Fault Injector - School of Computing

Report
Towards Formal Approaches
to System Resilience
Vishal Chandra Sharma*, Arvind Haran*,
Zvonimir Rakamaric*, Ganesh Gopalakrishnan*§
{vcsharma, haran, zvonimir, [email protected]
School of Computing, University of Utah
*Supported
in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426.
§Faculty Associate, SUPER (http://super-scidac.org/)
Overview
• Introduction
• Fault Injector
• Case Study
• Fault Detector
• Concluding Remarks
2
Motivation
• Recent studies show resiliency as a growing area of
concern [arg13] [lanl05]
• MTBF decreasing at a faster rate in exascale computing
• Dynamic voltage/frequency scaling in low power
computing
• Our goal is to improve application-level resiliency
• Primary focus is to detect transient faults in a
software
 Silent data corruption (SDC)
3
Motivating Example
int x = 2;
int y = 11;
if (x < 3 && y > 10)
y++;
else
x++;
printf(“x=%d, y=%d” ,x ,y)
4
Motivating Example
int x = 2;
int y = 11;
if (x < 3 && y > 10)
y++;
else
x++;
printf(“x=%d, y=%d” ,x ,y)
Program output:
x=2, y=12
5
Motivating Example
int x = 3;
int y = 11;
LSB position of
x flipped
if (x < 3 && y > 10)
y++;
else
x++;
printf(“x=%d, y=%d” ,x ,y)
6
Motivating Example
int x = 3;
int y = 11;
LSB position of
x flipped
if (x < 3 && y > 10)
y++;
else
x++;
printf(“x=%d, y=%d” ,x ,y)
Program output:
x=4, y=11
SDC in the output
value of x
7
Our Contribution
• A LLVM-level fault Injector for evaluation purpose
[llvm04]
• A simple case study on sorting algorithms
• Demonstrates effectiveness of our solution
• Highlights importance of design space exploration w.r.t.
resiliency
• A software-level fault detector based on idea of
predicate abstraction
• Applying it in resiliency research is a novel direction!
• Introduced by Ball to define a novel program coverage
metrics [pct05]
8
Closely Related Work
• Low-cost software level detectors
 iSWAT by Sahoo et. al. uses likely program invariants [iswat08]
 Derives likely invariants by monitoring program properties
 Hardware-assisted framework to detect false positives
 Error detector by Sloan et.al. [sloan13]
 Algorithm based error detector applied to linear solvers
 Utilizes algorithmic properties of linear solvers to detect
and isolate errors
• Software-level fault injectors
 LLVM-level fault injector developed by Kuijif et. al. [relax10]
 Publicly unavailable
 A recent study done by a user suggests our fault injector
has better fine-grained options [schen13]
 LLFI fault injector by Thomas et. al. [thomas13]
 Developed around same time as our fault injector, shares
many similar features
9
Overview
• Introduction
• Fault Injector
• Case Study
• Fault Detector
• Concluding Remarks
10
Fault Injector
Kontrollable Utah’s LLVM based Fault Injector  KULFI
KULFI  Indian dessert 
11
KULFI: Fault Injection Logic
Start
Forall dynamic
instructions
Feasible?
No
Yes
Inject Fault with user
provided probability
Stop
12
KULFI: Fault Injection Process
Program
Clang
KULFI
LLVM bitcode
Program
Input Vectors
Execution
Outcome
LLVM
LLVM
KULFI
Dynamic
Instruction
Count
Fault
Injecting
LLVM bitcode
SDC
SegFault
Benign
13
Overview
• Introduction
• Fault Injector
• Case Study
• Fault Detector
• Concluding Remarks
14
Case Study
Sorting routines - Bubblesort, Quicksort, Mergesort, Radixsort,
Heapsort
1 Experiment
200 Fault Injection
Campaigns
1 Fault Injection
Campaign
100 Program Runs
Total number of fault injections = 200*100 = 20,000!
15
Case Study
• A dynamic instruction is chosen at random for fault
injection
• A single-bit fault in a random bit position of the
dynamic instruction’s register
• For each fault injection campaign, categorize outcome
into SDC, Benign, or Segmentation fault categories
 Benign: 41, Segmentation: 29, SDC: 30
16
Case Study
• Plot fault count from a fault category corresponding to
a fault injection campaign
• X axis: Fault count corresponding to a fault injection
campaign
• Y axis: Sorting routines
• Result shows strong clustering pattern with statistically
significant distribution for each fault category
17
Results
18
Results
19
Results
20
Overview
• Introduction
• Fault Injector
• Case Study
• Fault Detector
• Concluding Remarks
21
A Software-Level Approach to Fault Detection
• Predicates: Boolean program conditionals
• Predicate State: <PP,BV>
• PP: Program point between two successive program
statements
• BV: Bit-vector representing concrete boolean values of
program conditionals at a given program point
• Predicate State Transition: Current State  Next State
22
A Software-Level Approach to Fault Detection
Foo(int x, int y){
<PP0,TT>
PP0:
If ( x<3 && y>10 )
{
<PP1,TT>
PP1:
y++;
PP2:
}
else
{
PP3:
x++;
PP4:
}
PP5:
printf(“x=%d, y=%d”,x , y)
}
Predicates: x<3, y>10
Program Points: PP0, PP1, PP2, PP3,
PP4, PP5
Input Vectors: x = 2, y = 11
Predicate State at PP0: <PP0, TT>
Predicate State at PP1: <PP1, TT>
Predicate State Transition:
<PP0, TT>  <PP1, TT>
23
A Software-Level Approach to Fault Detection
Start
Start
Program P
Program P
Instrumented Program P2
Instrumented Program P1
Get Predicate Transition
Extract predicate
transitions
Stop
Profile valid predicate
transitions
Check if
Valid ?
Yes
No
last
transition
?
No
Fault Detected
Yes
Stop
Detect transient faults
24
Predicate Transition Diagram (PTD)
Start
Program
Inject Fault
Track Predicate
Transitions
Track Predicate
Transitions
Merge
Predicate
Transition Diagram
Stop
25
PTD of Foo()
26
PTD of dgstrf() in SuperLU [slu99,05,11]
• SuperLU is a direct linear solver for sparse and nonsymmetric
systems of linear equations
• Available at: http://crd-legacy.lbl.gov/~xiaoye/SuperLU/
27
PTD of BlkSchlsEqEuroNoDiv() in Blackscholes
• Financial analysis using blackscholes PDE
• Part of Parsec 3.0 benchmark suite [parsec08]
28
Overview
• Introduction
• Fault Injector
• Case Study
• Fault Detector
• Concluding Remarks
29
Concluding Remarks
• A novel software-level fault detector
• Enabling infrastructure for resiliency analysis and
evaluation through KULFI
• Recommended use during design space exploration
• Try out KULFI: https://github.com/soar-lab/KULFI
30
References
[arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCSTM-33. Argonne National Laboratory (ANL), 2013
[lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los
Alamos National Laboratory's ASC Q supercomputer." IEEE Transactions on
Device and Materials Reliability, 2005
[llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong
program analysis & transformation,” in International Symposium on Code
Generation and Optimization (CGO), 2004
[pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in
International Conference on Formal Methods for Components and Objects
(FMCO), 2005
[iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y.
Zhou, “Using likely program invariants to detect hardware errors,” in IEEE
International Conference on Dependable Systems and Networks (DSN), 2008
[sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic
approach to error localization and partial recomputation for low-overhead fault
tolerance.“, in IEEE International Conference on Dependable Systems and
Networks (DSN), 2013
31
References
[slu99] Demmel, James W., et al. "A supernodal approach to sparse partial
pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999
[slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and
user interface." ACM Transactions on Mathematical Software (TOMS), 2005
[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I.
(2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~
xiaoye/SuperLU/superlu_ug. Pdf.
[sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix
collection." ACM Transactions on Mathematical Software (TOMS), 2011
[parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite:
Characterization and architectural implications,” ser. PACT, 2008
[relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural
framework for software recovery of hardware faults,” in International Symposium
on Computer Architecture (ISCA), 2010
[thomas13] Thomas, Anna, and Karthik Pattabiraman. "Error Detector
Placement for Soft Computation." in International Conference on Dependable
Systems and Networks (DSN), 2013.
[schen13] S. Chen, personal communication, 2013.
32
Acknowledgements
•
•
•
•
•
•
Pedro Diniz
Prabhakar Kudva
Shuvendu Lahiri
Karthik Pattabiraman
Sui Chen
Anonymous reviewers of PRDC conference who reviewed
our paper
33
Thank you!
34

similar documents