slides

Report
Soft Error Benchmarking of
L2 Caches with PARMA
Jinho Suh
Mehrtash Manoochehri
Murali Annavaram
Michel Dubois
Outline
•
Introduction
• Soft Errors
• Evaluating Soft Errors
•
PARMA: Precise Analytical Reliability Model for Architecture
• Model
• Applications
•
Conclusion and Future work
2
Outline
•
Introduction
• Soft Errors
• Evaluating Soft Errors
•
PARMA: Precise Analytical Reliability Model for Architecture
• Model
• Applications
•
Conclusion and Future work
3
Soft Errors
•
Random errors on perfect circuits, mostly affect SRAMs
• From alpha particles (electric noises)
• From neutron strikes in cosmic rays
•
Severe problem, especially with power saving techniques
• Superlinear increases with Voltage & Capacitance scaling
• Near-/sub-threshold Vdd operation for power savings
•
Concerns in:
•
•
•
•
Large servers
Avionic or space electronics
SRAM caches with drowsy modes
eDRAM caches with reduced refresh rates
4
Why Benchmark Soft Errors
•
Designers need good estimation of expected errors to incorporate
‘just-right’ solution at design time
•
Good estimation is non-trivial
• Multi-Bit Errors are expected
• Masking effects: Not every Single Event Upset leads to an error
[Mukherjee’03]
• Faults become errors when they propagate to the outer scope
• Faults can be masked off at various levels
•
Design decision
• When the protection under consideration is too much or too little?
• Is a newly proposed protection scheme better?
• The impact of soft errors needs to be addressed at design time
• Estimating soft error rates for target application domains is an
important issue
5
Evaluating Soft Errors:
Some Reliability Benchmarking Approaches
•
Fundamental difficulty: Soft errors happen very rarely
Field Analysis
Life Testing
• Intrinsic FIT (Failure-in-Time) rate
• Highly
pessimistic:
no consideration
of masking effects
• Difficulty
in collecting
data
[Ziegler]
• Unclear
for protected
• Obsolete
for designcaches
iteration
• AVF [Mukherjee’03] and SoftArch [Li’05]
Fault Injection
Accelerated
Testing
• Quickly
compute
SDC without protection or DUE under parity
• Ignores temporal/spatial MBEs
• Require
massive
experiments
• Can’t
account
for error
detection/correction schemes
• Distortion in measurement/interpretation
Analytical Modeling
Intrinsic
SER
AVF
SoftArch
• Better for estimating SER in short time
• Complexity determines preciseness
6
Outline
•
Introduction
• Soft Errors
• Evaluating Soft Errors
•
PARMA: Precise Analytical Reliability Model for Architecture
• Model
• Applications
•
Conclusion and Future work
7
Two Components of PARMA
(Precise Analytical Reliability Model for Architecture)
1. Fault generation model
Poisson Single Event Upset model
Probability distribution of having k faulty bit(s) in
a domain (set of bits) during multiple cycles
2. Fault propagation model
• Fault becomes Error when faulty bit is consumed
• Instruction with faulty bit commits
• Load commits and its operand has a faulty bit
• PARMA measures:
Generated faults  Propagated faults  Expected errors  Error rate
8
Using Vulnerability Clocks Cycles
to Track Bit Lifetime
•
•
Used to track cycles that any bit spends in vulnerable component: L2$
• Ticks when
a bit
resides
L2
When
a word
is in
updated
to
• Stops when
a bit
stays
outside
L2
hold
new
data,
its VC
When this block is refilled
later, VCs should start
Similar to lifetime
analysis
in AVF method
resets
to zero
ticking from here
Proc
L1$
L2$
Main Memory
VC: ticks
VC: stops
Set of bits
Set of bits
Word#
0
1
2
3
VC
100
200
500
0
100
300
0
100
200
500
0
100
200
500
0
L2
When
Accesses
block
L1isblock
NOT
to L1$
is
dead
evicted,
even when
determines
consumption
it is evicted
REALto
of
impact
MEM
the faulty
because
of
it
Soft can
Error
bits
beto
refilled
isthe
finalized
system
into L2 later
9
Probability of a Bit Flip in One Cycle
•
SEU Model
• p : probability that one bit is flipped during one cycle period
• Poisson probability mass function gives p
p

odd j
j
j!
 e 
• λ: Poisson rate of SEUs
• ex) 10-25/bit @ 65nm 3GHz CPU
10
Temporal Expansion:
Probability of a Bit Flip in Nc vulnerability cycles
•
q(Nc) : probability of a bit being faulty in Nc vulnerability cycles
q(Nc)
p
timeline
1 Cycle
Period
Vulnerability Clock Cycle = Nc
• To be faulty at the end of Nc cycles, a bit must flip an odd number of times
in Nc
N 
Pi ( Nc )   c  p i (1  p) Nc i , i  0,...,N c
 i 
q( N c ) 
Nc 2 
P
i 0
2i 1
( Nc )
11
Spatial Expansion:
from a Bit to the Protection Domain (Word)
•
SQ(k)
• Probability of set of bits S having k faulty bits inside (during Nc cycles)
S
Q(k)
Protection Domain
q(N ) S : Word
c
p
qb(k)
1 Cycle
Period
q(Nc)
1 Byte
timeline
……
Vulnerability Clock Cycle =
Nc
• Choose cases where there are k faulty bits in S
• S has [S] bits inside
• Assumed that all the bits in the word have the same VCs
• Otherwise, discrete convolution should be used
S
[S ]
Q(k )   q ( N c ) k (1  q ( N c ))[ S ] k , k  0,..., [ S ]
 k 
12
Faults in the Access Domain (Block)
•
DQ(k)
• Probability of k faulty bits in any protection domain inside of D ( Sm)
S
S
Q(k)
qb(k)
qb(k)
q(Nc)
q(Nc)
D
Domain
S : Word
Protection
Domain S : Word
Q(k)
1 Byte 1 Byte
……
Q(k)
……
……
Access Domain D : Block
• Choose cases where there are k faulty bits in each Sm
• Sum for all Sm in D
M
D
Q(k )   S Qm (k )
m 1
• So far, masking effect has not been considered
• Expected number of intrinsic faults/errors are calculated so far
13
Considering Masking Effect:
Separating TRUE from Intrinsic Faults
•
•
If all faults occur in unconsumed bits, then don’t care (FALSE events)
TRUE faults = {All faults in S} – {All faults in unconsumed bits}
•
S
Q(k )CQ(0)C Q(k )
protection domain
bit
...
...
...
grey-colored: consumed bits (C)
white-colored:unconsumed bits (C)
¯
• Probability that C has k faults, and C has 0 fault: FALSE or masked faults
• Deduct the probability that ALL k faulty bits are in the unconsumed bytes from
the probability that the protection domain S has k faulty bits to obtain the
probability of TRUE faults which becomes SDCs or TRUE DUEs
• C and C are obtained through simulations
Using PARMA to Measure Errors in Block
Protected by block-level SECDED
B
k>=3 is SDC
8NB
8NC
k 3
i 3
E SB , SDC   BQ(k )   C Q(0)C Q(i )
All faulty bits unconsumed
>=3 faults in Block
•
Undetected error that affects reliability (SDC): three or more faulty
bits in the block; at least one faulty bit in the consumed bits
B
E SB ,TRUE _ DUE  BQ(2) C Q(0)C Q(2)
k =2 is DUE
2 faults in Block
•
All faulty bits unconsumed
Detected error that affects reliability (TRUE DUE): exactly two faulty
bits in the block; at least one faulty bit in the consumed bits
See paper for how to apply PARMA on the different protection schemes
15
Four Contributions
1. Development of the rigorous analytical model called PARMA
Modeling
Application
2. Measuring SERs on structures protected by various schemes
3. Observing possible distortions in the accelerated studies
• Quantitatively
• Qualitatively
4. Verifying approximate models
16
Measuring SERs on Structures
Protected by Various Schemes
•
Target Failures-In-Time of IBM Power6
•
•
•
Average L2 (256KB, 32B block) cache FITs:
•
Results were verified with AVF simulations
100M SimPoint simulations of 18 benchmarks from SPEC2K, on sim-outorder
Schemes
SDC
(TRUE+FALSE)
DUE
Latency
Checkbits
per 256 bits
No Protection
155.66
N/A
10
0
1-bit Odd Parity
2.53E-15
372.83
10
1
Block-level SECDED
8.34E-31
7.04E-15
14
10
Word-level SECDED
2.92E-33
6.32E-16
13
56
•
•
•
SDC: 114
DUE: 4,566
Implies word-level SECDED might be overkill in most cases
Implies increasing the protection domain size: ex) CPPC @ISCA2011
Partially protected caches or caches with adaptive protection schemes need to
be carefully quantified for their FITs
•
PARMA provides comprehensive framework that can measure the effectiveness of
such schemes
17
Observing Possible Distortions
in the Accelerated Tests
•
Highly accelerated tests
• SPEC2K benchmarks end in several minutes (wall-clock time)
• Needs to accelerate SEU rate 1017 times to see reasonable faults
1.E+21
MAX Possible Errors
• How to scale down the results?
FIT: ammp
1.E+19
1.E+17
1.E+15
1.E+13
DUE
•
Results multiplied by 10-17 times?
•
Can distort results quantitatively
1.E+11
1.E+09
• SDC > DUE ?
SDC
1.E+07
1.E+05
1.E+03
1.E+16
1.E+18
1.E+20
•
Having more than two errors
overwhelms the cases of having
two errors
•
Can be misleading qualitatively
1.E+22
SEU Rate
Results were verified with fault-injection simulations
18
Verifying Approximate Models
•
Example: model for word level SECDED protected cache
• Methods for determining cache scrubbing rates[Mukherjee’04][Saleh’90]
• Ignoring cleaning effects at accesses: overestimate by how much?
• New model with geometric distribution of Bernoulli trials
• Assumption:
most
areWrite
flipped
between
<1> Readat
word
#1:two bits<2>
to word
#1: two accesses to the same word
• EveryActivate
access ECC
results
error
or in no-error (corrected)
code,in a detected
Updating
word,
removing existing 1
removing any faulty
faulty
bit: pmf of two Poisson
bit arrivals
time
PDUE
New approximate model
Average
interval
unACE interval
MTTF forACE
having
2nd faulty bits AverageMTTF
extended due to
in the same word
<1>
<2>
Average
access interval between two accesses to the
same word
AVF xTFIT
previous
2.8246E-14
AVG: from
method
MTTFSECDED- Word 
2.1454 FIT
1
PDUE
 TAVG [sec]
Mean of geometric distribution
FIT
PARMA
6.3170E-16 FIT
• PARMA provides rigorous reliability measurements
• Hence, it is useful to verify the faster, simpler approximate models
19
Outline
•
Introduction
• Soft Errors
• Evaluating Soft Errors
•
PARMA: Precise Analytical Reliability Model for Architecture
• Model
• Applications
•
Conclusion and Future work
20
Conclusion and Future Work
•
Summary
+
+
+
+
PARMA is a rigorous model measuring Soft Error Rates in architectures
PARMA works with wide range of SEU rates without distortion
PARMA handles temporal MBEs
PARMA quantifies SDC or DUE rates under various error
detection/protection schemes
- PARMA does not address spatial MBEs yet
- PARMA does not model TAG yet
- Due to the complexity, PARMA is slow
•
Future Work
• Extend PARMA to account for spatial MBEs and TAG vulnerability
• Develop sampling methods to accelerate PARMA
21
THANK YOU!
QUESTIONS?
(Some) References
[Biswas’05] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. Mukherjee, R Rangan,
Computing Architectural Vulnerability Factors for Address-Based Structures, In Proceedings of
the 32nd International Symposium on Computer Architecture, 532-543, 2005
[Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for
Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on
Dependable Systems and Networks, 496-505, 2005.
[Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A
systematic methodology to calculate the architectural vulnerability factors for a highperformance microprocessor. In Proceedings of the 36th International Symposium on
Microarchitecture, pages 29-40, 2003.
[Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in
Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium
on Dependable Computing, 37-42, 2004.
[Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery
Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), 114-122, 1990.
[Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress
Semiconductor Corp
23
Addendum
Some Definitions
•
•
SDC = Silent Data Corruption
DUE = Detected and unrecoverable error
•
SER = Soft Error Rate = SDC + DUE
•
Errors are measured as
• MTTF = Mean Time to Failure
• FIT = Failure in Time ; 1 FIT = 1 failure in billion hours
• 1 year MTTF = 1billion/(24*365)= 114,155 FIT
•
FIT is commonly used since FIT is additive
•
Vulnerability Factor = fraction of faults that become errors
• Also called derating factor or soft error sensitivity
25
Soft Errors and Technology Scaling
•
Hazucha & Svensson model
Circuit _ SER  Const  Flux Area e

Qcrit
Qcoll
• For a specific size of SRAM array:
• Flux depends on altitude & geomagnetic shielding (environmental factor)
• (Bit)Area is process technology dependent (technology factor)
• Qcoll is charge collection efficiency, technology dependent
• Qcrit  Cnode * Vdd
• According to scaling rules both C and V decrease and hence Q decreases
rapidly
• Static power saving techniques (on caches) with drowsy mode or using
near-/sub-threshold Vdd make cells more vulnerable to soft errors
Hazucha et al, “Impact of CMOS technology scaling on the atmospheric neutron soft error rate ”
26
Error Classification
•
•
Silent Data Corruption (SDC)
TRUE- and FALSE- Detected Unrecoverable Error (DUE)
Consumed
?
Consumed
?
C. Weaver et al, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” ISCA 2004
27
Soft Error Rate (SER)
•
Intrinsic SER – more from the component’s view
• Assumes all bits are important all the time
•
Intrinsic SER projections from ITRS2007 (High Performance model)
Year or production
2010
2013
2016
2019
2022
Feature size [nm]
45
35
25
18
13
Gate Length [nm]
18
13
9
6
4.5
1200
1250
1300
1350
1400
1.2E-6
1.25E-6
1.3E-6
1.35E-6
1.4E-6
32%
64%
100%
100%
100%
Soft Error Rate [FIT per Mb]
Failure Rate in 1Mb [fails/hour]
% Multi-Bit Upsets in Single Event Upsets
•
Intrinsic SER of caches protected by SECDED code?
• Cleaning effect on every access
•
Realistic SER – more from the system’s view
• Some soft errors are masked and do not cause system failure
• EX) AVF x Intrinsic SER: what about caches with protection code?
28
Soft Error Estimation Methodologies: Industries
•
Field analysis
• Statistically analyzes reported soft errors in market products
• Using repair record, sales of replacement parts
• Provides obsolete data
•
Life testing
•
•
•
•
•
Tester constantly cycles through 1,000 chips looking for errors
Takes around six months
Expensive, not fast enough for chip design process
Usually used to confirm the accuracy of accelerated testing (x2 rule)
Accelerated testing
• Chips under various beams of particles, under well-defined test protocol
• Terrestrial neutrons – particle accelerators (protons)
• Thermal neutrons – nuclear reactors
• Radioactive contamination – radioactive materials
•
Hardship
• Data rarely published: potential liability problems of products
• Even rarer the comparison of accelerated testing vs life testing
• IBM, Cypress published small amount of data showing correlation
J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp
29
Soft Error Estimation Methodologies:
Common Ways in Researches
•
Fault-injection
• Generate artificial faults based on the fault model
+
-
•
Applicable to wide level of designs (from RTL to system simulations)
Massive number of simulations necessary to be statistically valid
Highly accelerated Single Event Upset (SEU) rate is required for Soft Errors
How to scale down the measurements to ‘real environment’ is unclear
Architectural Vulnerability Factor
• Find derating factor (Faults  Errors) by {ACE bits}/{total bits} per cycle
•
SoftArch
• Extrapolate AVG(TTFs) from one program to MTTF using infinite executions
•
AVF and SoftArch – uses simplified Poisson fault generation model
+ Works well with small scale system in the current technology at earth’s
surface: single bit error dominant environment
- Can’t account for error protection/detection schemes (ECC)
- Unable to address temporal & spatial MBEs
•
AVF is NOT an absolute metric for reliability
• FITstructure = intrinsic_FITstructure * AVFstructure
M. Li et al, “Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults,” HPCA 2009
S. S Mukherjee et al, “A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor.”
MICRO 2003
X. Li et al, “SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors.” DSN 2005
30
Evaluating Soft Errors:
Some Reliability Benchmarking Approaches
•
Intrinsic FIT (Failure-in-Time) rate – highly pessimistic
• Every bit is vulnerable in every cycle
• Unclear how to compute intrinsic FIT rates for protected caches
•
Architectural Vulnerability Factor [Mukherjee’03]
• Lifetime analysis on Architecturally Correct Execution bits
• De-rating factor (Faults  Errors); realistic FIT = AVF x Intrinsic FIT
•
SoftArch [Li’05]
• Computes TTF for one program run and extrapolates to MTTF
•
AVF and SoftArch
+ Quickly compute SDC with no parity or DUE under parity
- Ignores temporal MBEs
• Two SEUs on one word become two faults instead of one fault
• Two SEUs on the same bit become two faults instead of zero fault
- Ignores spatial MBEs
- Can’t account for error detection / correction schemes
To compare SERs of various error correcting schemes:
• Temporal/spatial MBEs must be accurately counted
Prior State of the Art Reliability Model: AVF
•
Architectural Vulnerability Factor (AVF)
• AVFbit = Probability a bit matters (for a user-visible error)
= # of bits affects to user-visible outcome / total # of bits
• If we assume AVF = 100% then we will over design the system
• Need to estimate AVF to optimize the system design for reliability
• AVF equation for a target structure
N
AVFstructure 
 (bitwiseAVF)
N
i
 ACEcycles
i

N
N  total _ cycles
Averagenumber of ACE bits in a structurein a cycle

T otalnumber of bits in a structure
•
i 0
i 0
……(Eq. 1)
AVF is NOT an absolute metric for reliability
• FITstructure = intrinsic_FITstructure * AVFstructure
Shubu Mukherjee, “Architecture design for soft errors”
32
ACEness of a bit
•
ACE (Architecturally Correct Execution) bit
• ACE bit affects program outcome: correctness is subjective (user-visible)
• Microarchitectural ACE bits
• Invisible to programmer, but affects program outcome
• Easier to consider Un-ACE bits
– Idle/Invalid/Misspeculated state
– Predictor structures
– Ex-ACE state (architecturally dead or invisible states)
• Architectural ACE bits
• Visible to programmer
• Transitive (ACE bit in the word makes the Load instruction ACE)
• Easier to consider Un-ACE bits
– NOP instructions
– Performance-enhancing operations (non-opcode field of non-binding prefetch, branch
prediction hint)
– Predicated false instructions (except predicate status bit)
– Dynamically dead instructions
– Logical masking
•
AVF framework = lifetime analysis to correctly find ACEness of bits in
the target structure for every operating cycle
Shubu Mukherjee, “Architecture design for soft errors”
33
Rigorous Failure/Error Rate Modeling
•
In existing methodologies such as AVF multiplied by intrinsic rate
• Estimation is simple and easy
• Imprecise estimation but safe-overestimation
•
Downside of classical approach (i.e. AVF-based methodology)
• SEU is very rare event while program execution time is rather short
• In 3GHz processor, SEU rate is 1.0155E-25 within one cycle for one bit
• Equivalently, the probability of being hit by SEU and being faulty bit is 1.0155E-25
• Simplified assumption that one SEU results one Fault/Error directly
• same bit may be hit multiple times, and/or
• multiple bits may become faulty in a word
•
In space, or when extremely low Vdd is supplied to SRAM cell:
• SEU rate could rise high (more than 10E6 times)
• Second order effects become significant
•
With data protection methodology:
• How to measure vulnerability is uncertain due to the simplified assumption
34
Reliability Theory (1)
•
Fundamental definition of probability in Reliability Theory
• Number(Event)/Number(Trials): Approximations of true Prob(Event)
• True probability is barely known
• approx  true when trials  ∞ by the Law of Large Numbers
•
Two events in R-T: Survival & Failure of a component/system
•
Reliability Functions
• (Component/system) Reliability R(t), and Probability of Failure Q(t)
Nf
Ns
R(t ) 
, Q(t ) 
, R(t )  Q(t )  1
Ns  N f
Ns  N f
• Prob(Event) up to and at time t: conditional probability
• Note that R(t), Q(t) are time dependent in general
• (Conditional) Instantaneous Failure Rate λ(t) - a.k.a, Hazard function h(t)
 (t )  
1 dR(t )
R(t ) dt
35
Reliability Theory (2)
•
Reliability functions (cont’d)
• (Unconditional) Failure Density Function f(t)
f (t )  
dR(t )
f (t )
,  (t ) 
dt
R(t )
• Average Failure Rate from time 0 to T

AFR(0, T ) 
T
0
 (t )dt
T
• Discrete dual of λ(t) - Hazard Probability Mass Function h(j)
h( j )  Pr ob(t  j | t  j )  Pr ob(fail at j | surviveduntil j  1)
• Average Failure Rate from timeslot 0 to T
T
AFR(0, T ) 
 h( j )
j 1
T
36
Reliability Theory (3)
•
How to measure Reliability
• R(t) itself
• Events with constant failure rate
T
R (t )  exp     (t ) dt 
 o

• MTTF

MTTF   t  f (t )dt
0
• Sampling issue: Usually no test can aggregate total test time to ∞
• (Right) censorship with no replacement, then Maximum Likelihood Estimation
– by B. Epstein, 1954
– At the end of the test time tr, measure TTFs (ti) for samples that failed and truncate the
lifetime of all survived samples to tr
– Then, MLE of MTTF is
r
mˆ 
t
i 1
i
 (n  r )t r
, where n : # totalitems,r : # failed items
r
• FIT – one intuitive form of failure rate
• Failures in time 1E9 hours
• Interchangeable with MTTF only when failure rate is constant
• Additive between independent components
37
Vulnerability Clock
•
Used to track cycles that any bit spends in vulnerable component: L2$
• Ticks when a bit resides in L2
• Stops when a bit stays outside L2
VC_L1
Updates
BLK fetched
to L1
VC_L1
Stops
Store or
Consume
on L1
BLK fetched
to L2
L1 BLK
Discarded:
VC_L2
Ticks
BL
K
BLK fetched
by L1 miss
VC_L1
Resets
L1 BLK
Replaced
PARMA calculation
for SDC/True DUE
@ L1 Caches
(Resistant to SEUs)
BLK
replaced
VC_L2
Updates
L1 BLK
Replaced
Writeback
Cold
Store
Miss
fe
tc
to hed
L2
VC_MEM
VC_MEM
VC_MEM
:= VC_L2
== 80
0 = 80
VC_MEM
Stops
VC_MEM
Updates
L1 BLK
Writes back:
Bit untouched
VC_L2
Resets
VC_L2
VC_L2
VC_L2
VC_L2
VC_L1=
===
=150
100
0
10 =080
VC_L2
80
VC_L2
:=:=
VC_MEM
Program
ends
Prepare PARMA
calculation
L1 BLK
Writes back:
Bit updated
or consumed
START
1st MEM BLK
Access
VC_L2
Stops
@ L2 Cache
(Vulnerable to SEUs)
VC_L1
VC_L1
VC_L1
:= VC_L2
:=
=0
0=0
END
L2 BLK
Writes back
to MEM
@ Memory
(Resistant to SEUs)
38
PARMA Model:
Measuring Soft Error FIT with PARMA
•
PARMA measures failure rate by accumulating failure probability mass
• Index processor cycle by j (1 ≤ j ≤ Texe)
• Total failures observed during Texe (failure rate):
• Equivalent to expected number of failures of type ERR
Texe
H ERR (Texe )   hERR ( j ) 1  E[ ERR]
j 1
• FIT extrapolation with infinite program execution assumption
FITERR 
•
E[ ERR]
 3600109
Texe  CyclePeriod
How to calculate hERR ( j ) ?
• Let’s start with p: probability that one bit is flipped during one cycle period
• Obtained from Poisson SEU model
39
PARMA Model:
Fault Generation Model
•
SEU Model
• Assumptions:
• All clock cycles are independent to SEUs
• All bits are independent to SEUs (do not account for spatial MBEs)
• Widely accepted model for SEU: Poisson model
• p : probability that one bit is flipped during one cycle period (in SBE cases)
• Spatial MBE case: probability that multi-bits become faulty during one cycle
• Poisson probability mass function gives p
• λ: Poisson rate of SEUs, ex) 10-25/bit @ 65nm 3GHz CPU
p

odd j
j
j!
 e 
40
PARMA Model:
Measuring Soft Error FIT with PARMA
•
PARMA measures failure rate by accumulating failure probability mass
• Index processor cycle by j (1 ≤ j ≤ Texe)
• A (conditional) failure probability mass at cycle j :
hERR ( j )  Pr(T ypeERR failureat j | survivedall typeof faultsuntil j )
• Total failures observed during Texe (failure rate):
• Equivalent to expected number of failures of type ERR
Texe
H ERR (Texe )   hERR ( j ) 1  E[ ERR]
j 1
• FIT extrapolation with infinite program execution assumption
FITERR 
E[ ERR]
 3600109
Texe  CyclePeriod
• Average FIT with multiple programs
FITERR 
 f  FIT
i
 benchmarki
i , ERR
41
Failures Measured in PARMA
•
No-protection, 1-bit Parity, 1-bit ECC on Word and 1-bit ECC on Block
No parity
1-bit Parity
1-bit ECC
TRUE DUE
SDC
TRUE DUE
SDC
SDC
word-level
block-level
word-level
block-level
Access
Domain
Block
Block
Block
Blk containing
M words
Block
Blk containing
M words
Block
Protection
Domain
N/A
Block
Block
Word
Block
Word
Block
≥1 in C
∀odd in S,
≥1 in C
∀even >0
in S,
≥1 in C
2 in any Sm,
≥1 in that Cm
2 in S,
≥1 in C
≥3 in any Sm,
≥1 in that Cm
≥3 in S,
≥1 in C
B
W
Faulty bits
Notation
B
E NP ,SDC
B
E P1B ,TRUE _ DUE
E P1B,SDC
E SW ,TRUE _ DUE
B
E SB ,TRUE _ DUE
W
E SW , SDC
B
E SB ,SDC
42
Spatial Expansion:
From a Bit to a Byte in Nc Vulnerability Cycles
•
qb(k)
• Probability of a Byte having k faulty bits (in Nc vulnerability cycles)
p
qb(k)
1 Cycle
Period q(Nc)
q(Nc)
1 Byte
timeline
Vulnerability Clock Cycle = Nc
• From 8 bits in the Byte, choose k faulty bit
8
qb (k )   q( N c ) k (1  q( N c ))8k , k  0,...,8
k 
43
Spatial Expansion:
from a Byte to the Protection Domain (Word)
•
SQ(k)
• Probability of set of bits S having k faulty bits inside (during Nc cycles)
S
Protection Domain S : Word
Q(k)
qb(k)
qb(k)
1 Byte
1 Byte
q(Nc)
……
q(Nc)
• Choose cases where there are k faulty bits in S
• Enumerate all possibilities of faulty bits in bytes of S such that their total
number = k
S
Q(k) 
 q
 l j k{ j}S

b, j
(l j )
44
Faults in the Access Domain (Block)
•
DQ(k)
• Probability of k faulty bits in any protection domain inside of D ( Sm)
S
Q(k)
S
qb(k)
qb(k)
q(Nc)
q(Nc)
D
Protection
Domain S : Word
Domain
S : Word
Q(k)
1 Byte 1 Byte
……
……
Q(k)
……
Access Domain D : Block
• Choose cases where there are k faulty bits in Sm
• Sum for all Sm in D
M
D
M
Q(k )   Qm (k )  
S
m 1
m 1
 q
l j k { j}Sm
b, j
(l j )
• So far, masking effect has not been considered
• Expected number of intrinsic faults/errors are calculated so far
45
PARMA Model:
Failures Measured in PARMA (1)
•
B
Unprotected cache
E
NP , SDC
8NB
•
8 NC
  Q(k )   Q(0) Q(i)
k 1
B
C
C
B
Nonzero, even # k faulty bits
in the block is SDC
SDCs: having at least one faulty
bit in the consumed bits

E
P1B ,TRUE _ DUE
All faults Unconsumed
Without protection, any nonzero faulty bit(s) will cause SDC

8NB
8N
B
Q(k ) 
even k  0
•
failure
•
E
P1B , SDC
i 1
B
•
Odd parity per block
•

8NB


C
C
Q(0)C Q(i )
even i  0
8NC
B
odd k
Q(k ) 

C
Q(0)C Q(i )
odd i
SDCs: having at least one faulty
bit in the consumed bytes, from
having nonzero, even number of
faulty bits in the block
TRUE DUEs: having at least one
faulty bit in the consumed
bytes, from having odd number
of faulty bits in the block
46
PARMA Model:
Failures Measured in PARMA (2)
•
SECDED per block
B
E
SB , SDC
  Q(k )   Q(0) Q(i )
B
E
SB ,TRUE _ DUE
k>=3 is SDC
SECDED per word
M
8NC
8NB
k 3
B
•
C
B
C
E
SW , SDC
  W EmSW , SDC
m 1
i 3
8NC
m
8 NW

Cm
W
C
 Q(2) Q(0) Q(2)
    Qm (k )   Qm (0) Qm (i )
m 1  k 3
i 3


All faults Unconsumed
B
C
M
C
M
>=3 faults in Block
•
•
B
E
SW ,TRUE _ DUE
  W EmSW ,TRUE _ DUE
For all the words in the mblock
1
M


W
Qm (FITs
2) Cfrom
Qm (0each
)Cm Qword
SDCs: having at least one faultyAdditive because

m ( 2)
is independent
and counted separately
m 1
bit in the consumed bits, from
having more than two faulty
• Same to ‘per block’ case except
bits in the block
protection domain is word
TRUE DUEs: having at least one
• Because access domain is block,
faulty bit in the consumed bits,
all the words in the same block
from having exactly two faulty
are addressed by adding FITs
bits in the block
47
PARMA Simulations
•
Target processor
•
•
•
•
•
4-wide OoO processor
64-entry ROB
32-entry LSQ
McFarling’s hybrid branch predictor
Cache configuration
Cache
Associativity
Latency [cyc]
IL1: 32B BLK
16KB
1-way
2
DL1: 32B BLK
16KB
4-way
3
8-way
NP/P1b: 10
SW(4B):13
SB(64B): 14
UL2: 32B BLK
•
•
Size
256KB
sim-outorder was modified and executed with alpha ISA
18 benchmarks from SPEC2000 were used with SimPoint Sampling of
100M-instruction samples
48
Evaluating Soft Errors:
AVF or Fault-Injection, Why Not?
•
•
AVF fails for handling scenarios under error protection schemes
Why not use fault injection for such scenarios?
• Possible distortion in the interpretation of results due to the highly
accelerated experiments
49
Simulations with PARMA: Results in FIT (1)
(a) NP_ SDC: no-protection/SDC (≈ AVF_SDC)
P1B_TRUE_DUE:odd parity/TRUE DUE
(b) P1B_FALSE_DUE:odd parity/FALSE DUE
(c) P1B_ SDC: odd parity/SDC
(d) SB_TRUE_DUE: block-level SECDED/TRUE DUE
Bench
ammp
art
crafty
eon
facerec
galgel
gap
gcc
gzip
mcf
mesa
parser
perlbmk
sixtrack
twolf
vortex
vpr
wupwise
Average
(a)
320.32
48.76
429.47
382.25
98.08
60.35
138.59
349.11
547.53
14.71
460.52
138.54
100.82
76.24
193.40
831.26
184.31
146.69
155.66
(b)
419.27
16.74
716.45
298.23
0.59
77.61
22.27
229.96
1115.56
14.43
112.19
380.34
315.37
7.92
419.25
324.74
369.25
130.00
217.17
(c)
2.50E-14
1.22E-16
1.99E-14
1.45E-14
9.80E-17
9.52E-17
3.32E-16
3.76E-15
1.05E-14
1.28E-17
4.50E-15
1.86E-15
2.74E-15
3.75E-16
1.52E-15
8.57E-15
2.16E-15
1.99E-15
2.53E-15
(d)
2.53E-14
1.22E-16
2.34E-14
1.63E-14
9.79E-17
9.52E-17
3.94E-16
4.86E-15
1.17E-14
1.28E-17
5.03E-15
2.00E-15
2.97E-15
3.91E-16
1.56E-15
9.63E-15
2.18E-15
2.02E-15
3.45E-15
(e) SB_FALSE_DUE: block-level SECDED/FALSE DUE
(f) SB_SDC: block-level SECDED/SDC
(g) SW_TRUE_DUE: word-level SECDED/TRUE DUE
(h) SW_FALSE_DUE: word-level SECDED/TRUE DUE
(i) SW_SDC: word-level SECDED/SDC
(e)
1.53E-14
3.70E-17
4.74E-14
6.03E-15
1.37E-18
8.53E-17
5.26E-17
3.25E-15
9.56E-15
3.23E-17
8.57E-16
4.12E-15
8.85E-15
1.39E-16
2.88E-15
2.80E-15
3.27E-15
7.28E-16
3.59E-15
(f)
1.32E-29
3.89E-34
4.24E-30
2.72E-30
8.88E-35
3.54E-34
2.34E-33
1.89E-31
2.86E-31
2.93E-35
4.01E-32
4.59E-32
6.96E-32
9.70E-33
1.45E-32
1.70E-31
3.04E-32
1.75E-32
8.34E-31
(g)
1.99E-15
1.03E-17
1.85E-15
1.69E-15
1.18E-17
8.15E-18
4.30E-17
4.65E-16
1.30E-15
1.13E-18
5.44E-16
1.47E-16
2.36E-16
4.26E-17
1.24E-16
1.04E-15
1.83E-16
1.63E-16
2.25E-16
(h)
2.92E-15
9.01E-18
6.41E-15
9.77E-16
2.45E-19
1.38E-17
8.80E-18
4.68E-16
1.24E-15
4.35E-18
1.52E-16
5.82E-16
1.17E-15
2.14E-17
4.11E-16
4.31E-16
4.79E-16
1.70E-16
4.07E-16
(i)
1.02E-31
3.21E-36
2.83E-32
3.57E-32
1.22E-36
2.72E-36
2.59E-35
1.86E-33
3.67E-33
1.89E-37
4.73E-34
2.97E-34
4.46E-34
1.12E-34
1.06E-34
1.93E-33
2.45E-34
1.42E-34
2.92E-33
50
PARMA Application: a Gold-Standard for Developing
New Approximate Model (3)
•
Results
Name
ammp
art
crafty
eon
facerec
galgel
gap
gcc
gzip
mcf
mesa
parser
perlbmk
sixtrack
twolf
vortex
vpr
wupwise
Average
AVF
40.977%
2.849%
61.078%
99.049%
4.319%
6.010%
7.118%
27.658%
83.466%
1.267%
30.070%
22.983%
31.621%
3.916%
26.750%
53.171%
24.232%
12.183%
27.33%
AVFxFIT from
previous method
2.9374
0.2042
4.3784
7.1003
0.3096
0.4308
0.5103
1.9827
5.9832
0.0908
2.1555
1.6475
2.2667
0.2807
1.9176
3.8115
1.7371
0.8733
2.1454
FIT from new
approximate model
8.4182E-14
4.5179E-16
3.3463E-14
1.3441E-13
1.3138E-15
7.0577E-16
4.5248E-16
7.7612E-15
6.9763E-14
2.9364E-16
1.6881E-14
1.8796E-14
2.9209E-14
5.9788E-16
2.2392E-14
3.6437E-14
4.0074E-14
1.1242E-14
2.8246E-14
FIT from PARMA
4.9114E-15
1.93476E-17
8.25685E-15
2.67121E-15
1.20444E-17
2.19513E-17
5.18293E-17
9.33375E-16
2.53328E-15
5.47892E-18
6.96557E-16
7.28453E-16
1.41053E-15
6.39658E-17
5.35443E-16
1.4704E-15
6.61992E-16
3.33145E-16
6.31707E-16
With PARMA, we can verify newly developed approximate models
51
Simulation with PARMA: Overhead
•
Need to track all memory footprint
• Vulnerability clock cycles for L1, L2 and Memory copies
•
Data structure: Binary Search Tree
• Quick search and insertion
• Memory footprint never decreases
•
Memory overhead: ~17 bytes for tracking 1 byte of memory footprint
•
Computation overhead: O(n3) with non-parallelized code
• n : number of bits in the block
• Probability calculation for having k specific faulty bits is O(n2)
• Need to know the probability distribution on k in [0, n]
•
Overall ~25x slowdown in simulation time from base sim-outorder
• Still much faster than doing massive number of tests with fault injection
52
PARMA Application: a Gold-Standard for Developing
New Approximate Model (1)
•
PARMA provides rigorous reliability measurements
• Hence, it is useful to verify faster, simpler approximate models
•
Example: model for word-level SECDED protected cache
• Known methods for determining cache scrubbing rates
• Model from previous work [Mukherjee’04][Saleh’90]
1 
MTTF 
L 2M
, L : SER for a word, M :# of words in memory
• Ignores cleaning effects at accesses
– Okay for determining cache scrubbing rates because it overestimates
– But by how much does it overestimate?
<1> Read word #1:
Activate ECC code,
removing existing 1
faulty bit
<2> Write to word #1:
Updating word,
removing any faulty
bit
MTTF for having 2nd faulty bits
in the same word
time
MTTF extended due to
<1>
<2>
53
PARMA Application: a Gold-Standard for Developing
New Approximate Model (2)
•
New model: model with Bernoulli attempts
• Assumption: at most two bits are flipped between two accesses to the
same word
• Every access results in a detected error or in no-error (corrected)
• An interval between two accesses is a binomial event
ACE
unACE
Texe
ACEAVG
unACEAVG
QuantumBinomial
• Expected number of attempts in binomial process for success = MTTFSECDED
• Then:
MTTFSECDED 
1
PDUE
 Quantum Binomial [sec]
• PDUE = Poisson PMF (, , )
54
PARMA Application: a Gold-Standard for Developing
New Approximate Model (3)
•
Word level SECDED average vulnerability, converted to FIT rate
New approximate model
AVF x Intrinsic FIT
from previous method
2.1454 FIT
2.8246E-14 FIT
PARMA
6.3170E-16 FIT
With PARMA, we can verify newly developed approximate models
55

similar documents