Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz Şen 26/04/2007

Yüklə 445 b.

tarix	18.03.2017
ölçüsü	445 b.
	#11868

Transient Fault Detection and Recovery via Simultaneous Multithreading

Nevroz ŞEN
26/04/2007

AGENDA

Introduction & Motivation
SMT, SRT & SRTR
Fault Detection via SMT (SRT)
Fault Recovery via SMT (SRTR)
Conclusion

INTRODUCTION

Transient Faults:

Faults that persist for a “short” duration
Caused by cosmic rays (e.g., neutrons)
Charges and/or discharges internal nodes of logic or SRAM Cells – High Frequency crosstalk

Solution

No practical solution to absorb cosmic rays

1 fault per 1000 computers per year (estimated fault rate)

Future is worse

Smaller feature size, reduce voltage, higher transistor count, reduced noise margin

INTRODUCTION

Fault tolerant systems use redundancy to improve reliability:

Time redundancy: seperate executions
Space redundancy: seperate physical copies of resources

DMR/TMR

Data redundancy

ECC
Parity

MOTIVATION

Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units
Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors

MOTIVATION

Less hardware compared to replicated microprocessors

SMT needs ~5% more hardware over uniprocessor
SRT adds very little hardware overhead to existing SMT

Better performance than complete replication

Better use of resources

Lower cost

Avoids complete replication
Market volume of SMT & SRT

MOTIVATION - CHALLENGES

Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping);

Equivalent instructions from different threads might execute in different cycles
Equivalent instructions from different threads might execute in different order with respect to other instructions in the same thread

Precise scheduling of the threads is crucial

Branch misprediction
Cache miss

SMT – SRT - SRTR

SRT: Simultaneous & Redundantly Threaded Processor
SRT = SMT + Fault Detection
SRTR: Simultaneous & Redundantly Threaded Processor with Recovery
SRTR = SRT + Fault Recovery

Fault Detection via SMT - SRT

Sphere of Replication (SoR)
Output comparison
Input replication
Performance Optimizations for SRT
Simulation Results

SRT - Sphere of Replication (SoR)

Logical boundary of redundant execution within a system
Components inside sphere are protected against faults using replication
External components must use other means of fault tolerance (parity, ECC, etc.)
Its size matters:

Error detection latency
Stored-state size

SRT - Sphere of Replication (SoR) for SRT

OUTPUT COMPARISION

Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system
No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices.
Check;

Address and data for stores from redundant threads. Both comparison and validation at commit time
Address for uncached load from redundant threads
Address for cached load from redundant threads: not required

Other output comparison based on the boundary of an SoR

OUTPUT COMPARISION – Store Queue

INPUT REPLICATION

Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;
Instructions: Assume no self-modification. No check
Cached load data:

Active Load Address Buffer
Load Value Queue

Uncached load data:

Synchronize when comparing addresses that leave the SoR
When data returns, replicate the value for the two threads

External Interrupts:

Stall lead thread and deliver interrupt synchronously
Record interrupt delivery point and deliver later

INPUT REPLICATION – Active Load Address Buffer (ALAB)

Delays a cache block’s replacement or invalidation after the retirement of the trailing load
Counter tracks trailing thread’s outstanding loads
When a cache block is about to be replaced:

The ALAB is searched for an entry matching the block’s address
If counter != 0 then:

Do not replace nor invalidate until trailing thread is done
Set the pending-invalidate bit

Else replace - invalidate

INPUT REPLICATION – Load Value Queue (LVQ)

An alternative to ALAB – Simpler
Pre-designated leading & trailing threads
Protected by ECC

INPUT REPLICATION – Load Value Queue (LVQ)

Advantages over ALAB;

Reduces the pressure on data cache ports
Accelerate fault detection of faulty addresses
Simple design

Performance Optimizations for SRT

Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques;

Slack Fetch

Maintains a constant slack of instructions between the threads
Prevents the trailing thread from seeing mispredictions and cache misses

Branch Outcome Queue (BOQ)

Performance Optimizations for SRT - Branch Outcome Queue (BOQ)

Simulation Results

Simulation Environment:
Modified Simplescalar “sim-outorder”
Long front-end pipeline because of out-of-order nature and SMT
Simple approximation of trace cache
Used 11 SPEC95 benchmarks

Simulation Results

ORH: On-Chip Replicated Hardware

ORH-Dual -> two pipelines, each with half the resources
SMT- Dual -> Replicated threads with no detection hardware

Simulation Results - Slack Fetch & Branch Outcome Queue

Max 27% performance improvements for SF, BOQ, and SF + BOQ
Performance better with slack of 256 instructions over 32 or 128
Prevents trailing thread from wasting resources by speculating

Simulation Results - Input Replication

Very low performance degradation for 64- entry ALAB or LVQ
On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.

Simulation Results - Overall

Comparison with ORH- Dual
SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64-entry LVQ
Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware

Fault Recovery via SMT (SRTR)

What is wrong with SRT: A leading non-store instruction may commit before the check for the fault occurs

Relies on the trailing thread to trigger the detection
However, an SRTR processor works well in a fail-fast architecture
A faulty instruction cannot be undone once the instruction commits.

Fault Recovery via SMT (SRTR) - Motivation

In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection.
In contrast, SRTR must not allow any leading instruction to commit before checking occurs,
SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes
In SPEC95, complete to commit takes about 29 cycles
This short slack has some implications:

Leading thread provides branch predictions
The StB, LVQ and BOQ need to handle mispredictions

Fault Recovery via SMT (SRTR) - Motivation

Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT).
Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Dependence-based checking elision (DBCE) to reduce the number of checks is developed
Recovery via traditional rollback ability of modern pipelines

SRTR Additions to SMT

SRTR – AL & LVQ

Leading and trailing instructions occupy the same positions in their ALs (private for each thread)

May enter their AL and become ready to commit them at different times

The LVQ has to be modified to allow speculative loads

The Shadow Active List holds pointers to LVQ entries
A trailing load might issue before the leading load
Branches place the LVQ tail pointer in the SAL
The LVQ’s tail pointer points to the LVQ has to be rolled back in a misprediction

SRTR – PREDQ

Leading thread places predicted PC
Similar to BOQ but only holds predictions instead of outcomes
Using the predQ, the two threads fetch essentially the same instructions
On a misprediction detection leading clears the predQ
ECC protected

SRTR – RVQ & CV

SRTR checks when the trailing instruction completes
The Register Value Queue is used to store register values for checking, avoiding pressure on the register file

RVQ entries are allocated when instruction enter the AL
Pointers to the RVQ entries are placed in the SAL to facilitate their search

If check succeeds, the entries in the CV vector are set to checked-ok and comitted
If check fails, the entries in the CV vectors are set to failed

Rollback done when entries in head of AL

SRTR - Pipeline

SRTR – DBCE

SRTR uses a separate structure, the register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Check each inst brings BW pressure on RVQ
DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.

SRTR – DBCE

Idea:

Faults propagate through dependent instructions
Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.

SRTR – DBCE

SRTR - Performance

Detection performance between SRT & SRTR
Better results in the interaction between branch mispredictions and slack.
Better than SRT between %1-%7

SRTR - Performance

CONCLUSION

A more efficient way to detect Transient Faults is presented
The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared.

Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ

An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.
SRT can be extended for fault recovery-SRTR

REFERANCES

T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.
S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.
Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.
S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002

Yüklə 445 b.

Dostları ilə paylaş:

Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz Şen 26/04/2007

Transient Fault Detection and Recovery via Simultaneous Multithreading

Nevroz ŞEN

26/04/2007

AGENDA

Introduction & Motivation

SMT, SRT & SRTR

Fault Detection via SMT (SRT)

Fault Recovery via SMT (SRTR)

Conclusion

INTRODUCTION

Transient Faults:

Solution

Future is worse

INTRODUCTION

Fault tolerant systems use redundancy to improve reliability:

MOTIVATION

Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units

Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors

MOTIVATION

MOTIVATION

MOTIVATION

Less hardware compared to replicated microprocessors

Better performance than complete replication

Lower cost

MOTIVATION - CHALLENGES

Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping);

Precise scheduling of the threads is crucial

SMT – SRT - SRTR

SMT – SRT - SRTR

SRT: Simultaneous & Redundantly Threaded Processor

SRT = SMT + Fault Detection

SRTR: Simultaneous & Redundantly Threaded Processor with Recovery

SRTR = SRT + Fault Recovery

Fault Detection via SMT - SRT

Sphere of Replication (SoR)

Output comparison

Input replication

Performance Optimizations for SRT

Simulation Results

SRT - Sphere of Replication (SoR)

Logical boundary of redundant execution within a system

Components inside sphere are protected against faults using replication

External components must use other means of fault tolerance (parity, ECC, etc.)

Its size matters:

SRT - Sphere of Replication (SoR) for SRT

OUTPUT COMPARISION

Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system

No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices.

Check;

Other output comparison based on the boundary of an SoR

OUTPUT COMPARISION – Store Queue

INPUT REPLICATION

Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;

Instructions: Assume no self-modification. No check

Cached load data:

Uncached load data:

External Interrupts:

INPUT REPLICATION – Active Load Address Buffer (ALAB)

Delays a cache block’s replacement or invalidation after the retirement of the trailing load

Counter tracks trailing thread’s outstanding loads

When a cache block is about to be replaced:

INPUT REPLICATION – Load Value Queue (LVQ)

An alternative to ALAB – Simpler

Pre-designated leading & trailing threads

Protected by ECC

INPUT REPLICATION – Load Value Queue (LVQ)

Advantages over ALAB;

Performance Optimizations for SRT

Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques;

Performance Optimizations for SRT - Branch Outcome Queue (BOQ)

Simulation Results

Simulation Environment:

Modified Simplescalar “sim-outorder”

Long front-end pipeline because of out-of-order nature and SMT

Simple approximation of trace cache

Used 11 SPEC95 benchmarks

Simulation Results

ORH: On-Chip Replicated Hardware

Simulation Results - Slack Fetch & Branch Outcome Queue