Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz Şen 26/04/2007



Yüklə 445 b.
tarix18.03.2017
ölçüsü445 b.
#11868


Transient Fault Detection and Recovery via Simultaneous Multithreading

  • Nevroz ŞEN

  • 26/04/2007


AGENDA

  • Introduction & Motivation

  • SMT, SRT & SRTR

  • Fault Detection via SMT (SRT)

  • Fault Recovery via SMT (SRTR)

  • Conclusion



INTRODUCTION

  • Transient Faults:

    • Faults that persist for a “short” duration
    • Caused by cosmic rays (e.g., neutrons)
    • Charges and/or discharges internal nodes of logic or SRAM Cells – High Frequency crosstalk
  • Solution

    • No practical solution to absorb cosmic rays
      • 1 fault per 1000 computers per year (estimated fault rate)
  • Future is worse

    • Smaller feature size, reduce voltage, higher transistor count, reduced noise margin


INTRODUCTION

  • Fault tolerant systems use redundancy to improve reliability:

    • Time redundancy: seperate executions
    • Space redundancy: seperate physical copies of resources
      • DMR/TMR
    • Data redundancy
      • ECC
      • Parity


MOTIVATION

  • Simultaneous Multithreading improves the performance of a processor by allowing multiple independent threads to execute simultaneously (same cycle) in different functional units

  • Use the replication provided by the different threads to run two copies of the same program so we are able to detect errors



MOTIVATION



MOTIVATION



MOTIVATION

  • Less hardware compared to replicated microprocessors

    • SMT needs ~5% more hardware over uniprocessor
    • SRT adds very little hardware overhead to existing SMT
  • Better performance than complete replication

    • Better use of resources
  • Lower cost

    • Avoids complete replication
    • Market volume of SMT & SRT


MOTIVATION - CHALLENGES

  • Cycle-by-cycle output comparison and input replication (Cycle-by-Cycle Lockstepping);

    • Equivalent instructions from different threads might execute in different cycles
    • Equivalent instructions from different threads might execute in different order with respect to other instructions in the same thread
  • Precise scheduling of the threads is crucial

    • Branch misprediction
    • Cache miss


SMT – SRT - SRTR



SMT – SRT - SRTR

  • SRT: Simultaneous & Redundantly Threaded Processor

  • SRT = SMT + Fault Detection

  • SRTR: Simultaneous & Redundantly Threaded Processor with Recovery

  • SRTR = SRT + Fault Recovery



Fault Detection via SMT - SRT

  • Sphere of Replication (SoR)

  • Output comparison

  • Input replication

  • Performance Optimizations for SRT

  • Simulation Results



SRT - Sphere of Replication (SoR)

  • Logical boundary of redundant execution within a system

  • Components inside sphere are protected against faults using replication

  • External components must use other means of fault tolerance (parity, ECC, etc.)

  • Its size matters:

    • Error detection latency
    • Stored-state size


SRT - Sphere of Replication (SoR) for SRT



OUTPUT COMPARISION

  • Compare & validate output before sending it outside the SoR - Catch faults before propagating to rest of system

  • No need to compare every instruction; Incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices.

  • Check;

    • Address and data for stores from redundant threads. Both comparison and validation at commit time
    • Address for uncached load from redundant threads
    • Address for cached load from redundant threads: not required
  • Other output comparison based on the boundary of an SoR



OUTPUT COMPARISION – Store Queue



INPUT REPLICATION

  • Replicate & deliver same input (coming from outside SoR) to redundant copies. To do this;

  • Instructions: Assume no self-modification. No check

  • Cached load data:

    • Active Load Address Buffer
    • Load Value Queue
  • Uncached load data:

    • Synchronize when comparing addresses that leave the SoR
    • When data returns, replicate the value for the two threads
  • External Interrupts:

    • Stall lead thread and deliver interrupt synchronously
    • Record interrupt delivery point and deliver later


INPUT REPLICATION – Active Load Address Buffer (ALAB)

  • Delays a cache block’s replacement or invalidation after the retirement of the trailing load

  • Counter tracks trailing thread’s outstanding loads

  • When a cache block is about to be replaced:

    • The ALAB is searched for an entry matching the block’s address
    • If counter != 0 then:
    • Else replace - invalidate


INPUT REPLICATION – Load Value Queue (LVQ)

  • An alternative to ALAB – Simpler

  • Pre-designated leading & trailing threads

  • Protected by ECC



INPUT REPLICATION – Load Value Queue (LVQ)

  • Advantages over ALAB;

    • Reduces the pressure on data cache ports
    • Accelerate fault detection of faulty addresses
    • Simple design


Performance Optimizations for SRT

  • Idea: Using one thread to improve cache and branch prediction behavior for the other thread. Two techniques;

    • Slack Fetch
      • Maintains a constant slack of instructions between the threads
      • Prevents the trailing thread from seeing mispredictions and cache misses
    • Branch Outcome Queue (BOQ)


Performance Optimizations for SRT - Branch Outcome Queue (BOQ)



Simulation Results

  • Simulation Environment:

  • Modified Simplescalar “sim-outorder”

  • Long front-end pipeline because of out-of-order nature and SMT

  • Simple approximation of trace cache

  • Used 11 SPEC95 benchmarks



Simulation Results

  • ORH: On-Chip Replicated Hardware



Simulation Results - Slack Fetch & Branch Outcome Queue

  • Max 27% performance improvements for SF, BOQ, and SF + BOQ

  • Performance better with slack of 256 instructions over 32 or 128

  • Prevents trailing thread from wasting resources by speculating



Simulation Results - Input Replication

  • Very low performance degradation for 64- entry ALAB or LVQ

  • On average a 16-entry ALAB and a 16-entry LVQ degrade performance by 8% and 5% respectively.



Simulation Results - Overall

  • Comparison with ORH- Dual

  • SRT processor: 256 slack fetch, BOQ with 128 entries, 64-entry store buffer, and 64-entry LVQ

  • Average: 16% Maksimum: %29 over a lockstepping processor with the “same” hardware



Fault Recovery via SMT (SRTR)

  • What is wrong with SRT: A leading non-store instruction may commit before the check for the fault occurs

    • Relies on the trailing thread to trigger the detection
    • However, an SRTR processor works well in a fail-fast architecture
    • A faulty instruction cannot be undone once the instruction commits.


Fault Recovery via SMT (SRTR) - Motivation

  • In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection.

  • In contrast, SRTR must not allow any leading instruction to commit before checking occurs,

  • SRTR uses the time between the completion and commit time of leading instruction and checks the results as soon as the trailing completes

  • In SPEC95, complete to commit takes about 29 cycles

  • This short slack has some implications:

    • Leading thread provides branch predictions
    • The StB, LVQ and BOQ need to handle mispredictions


Fault Recovery via SMT (SRTR) - Motivation

  • Leading thread provides the trailing thread with branch predictions instead of outcomes (SRT).

  • Register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Dependence-based checking elision (DBCE) to reduce the number of checks is developed

  • Recovery via traditional rollback ability of modern pipelines



SRTR Additions to SMT



SRTR – AL & LVQ

  • Leading and trailing instructions occupy the same positions in their ALs (private for each thread)

    • May enter their AL and become ready to commit them at different times
  • The LVQ has to be modified to allow speculative loads

    • The Shadow Active List holds pointers to LVQ entries
    • A trailing load might issue before the leading load
    • Branches place the LVQ tail pointer in the SAL
    • The LVQ’s tail pointer points to the LVQ has to be rolled back in a misprediction


SRTR – PREDQ

  • Leading thread places predicted PC

  • Similar to BOQ but only holds predictions instead of outcomes

  • Using the predQ, the two threads fetch essentially the same instructions

  • On a misprediction detection leading clears the predQ

  • ECC protected



SRTR – RVQ & CV

  • SRTR checks when the trailing instruction completes

  • The Register Value Queue is used to store register values for checking, avoiding pressure on the register file

    • RVQ entries are allocated when instruction enter the AL
    • Pointers to the RVQ entries are placed in the SAL to facilitate their search
  • If check succeeds, the entries in the CV vector are set to checked-ok and comitted

  • If check fails, the entries in the CV vectors are set to failed

    • Rollback done when entries in head of AL


SRTR - Pipeline



SRTR – DBCE

  • SRTR uses a separate structure, the register value queue (RVQ), to store register values and other information necessary for checking of instructions, avoiding bandwidth pressure on the register file.

  • Check each inst brings BW pressure on RVQ

  • DBCE (Dependence Based Checking Elision) scheme reduce the number of checks, and thereby, the RVQ bandwidth demand.



SRTR – DBCE

  • Idea:

    • Faults propagate through dependent instructions
    • Exploits register dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked.


SRTR – DBCE



SRTR - Performance

  • Detection performance between SRT & SRTR

  • Better results in the interaction between branch mispredictions and slack.

  • Better than SRT between %1-%7



SRTR - Performance



CONCLUSION

  • A more efficient way to detect Transient Faults is presented

  • The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared.

    • Defined some concepts: LVQ, ALAB, Slack Fetch and BOQ
  • An SRT processor can provide higher performance then an equivalently sized on-chip HW replicated solution.

  • SRT can be extended for fault recovery-SRTR



REFERANCES

  • T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient Fault Recovery using Simultaneous Multithreading,” Proc. 29th Annual Int’l Symp. on Computer Architecture, May 2002.

  • S. K. Reinhardt and S. S. Mukherjee. Transient-fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36, June 2000.

  • Eric Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.

  • S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002



Yüklə 445 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin