Fault tolerant systems use redundancy to improve reliability:
Time redundancy: seperate executions
Space redundancy: seperate physical copies of resources
DMR/TMR
Data redundancy
ECC
Parity
MOTIVATION
Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units
Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors
Logical boundary of redundant execution within a system
Components inside sphere are protected against faults using replication
External components must use other means of fault tolerance (parity, ECC, etc.)
Its size matters:
Error detection latency
Stored-state size
SRT - Sphere of Replication (SoR) for SRT
OUTPUT COMPARISION
Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system
No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices.
Check;
Address and data for stores from redundant threads. Both comparison and validation at commit time
Address for uncached load from redundant threads
Address for cached load from redundant threads: not required
Other output comparison based on the boundary of an SoR
OUTPUT COMPARISION – Store Queue
INPUT REPLICATION
Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;
Instructions: Assume no self-modification. No check
Cached load data:
Active Load Address Buffer
Load Value Queue
Uncached load data:
Synchronize when comparing addresses that leave the SoR
When data returns, replicate the value for the two threads
External Interrupts:
Stall lead thread and deliver interrupt synchronously
Record interrupt delivery point and deliver later
INPUT REPLICATION – Active Load Address Buffer (ALAB)
Delays a cache block’s replacement or invalidation after the retirement of the trailing load
In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection.
In contrast, SRTR must not allow any leading instruction to commit before checking occurs,
SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes
In SPEC95, complete to commit takes about 29 cycles
This short slack has some implications:
Leading thread provides branch predictions
The StB, LVQ and BOQ need to handle mispredictions
Fault Recovery via SMT (SRTR) - Motivation
Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT).
Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Dependence-based checking elision (DBCE) to reduce the number of checks is developed
Recovery via traditional rollback ability of modern pipelines
SRTR Additions to SMT
SRTR – AL & LVQ
Leading and trailing instructions occupy the same positions in their ALs (private for each thread)
May enter their AL and become ready to commit them at different times
The LVQ has to be modified to allow speculative loads
The Shadow Active List holds pointers to LVQ entries
If check fails, the entries in the CV vectors are set to failed
Rollback done when entries in head of AL
SRTR - Pipeline
SRTR – DBCE
SRTR uses a separate structure, the register value queue (RVQ),to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.
Check each inst brings BW pressure on RVQ
DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.
SRTR – DBCE
Idea:
Faults propagate through dependent instructions
Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.
SRTR – DBCE
SRTR - Performance
Detection performance between SRT & SRTR
Better results in the interaction between branch mispredictions and slack.
The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared.
Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ
An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.
SRT can be extended for fault recovery-SRTR
REFERANCES
T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.
S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.
Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.
S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002