Design and Test Technology for Automotive Electronic Systems

Report
New Approaches to
Fault-Tolerant Systems Design
Andreas Steininger
Vienna University of Technology
My contact data
Andreas Steininger
Vienna University of Technology
Faculty of Informatics
Institute of Computer Engineering
Embedded Computing Systems Group
Treitlstrasse 3
A- 1040 Vienna
Austria
[email protected]
http://ti.tuwien.ac.at/ecs
A. Steininger
page 2
Main Contributors to this Material
 Dr. Thomas Kottke
R. Bosch AG / EADS
 Dr. Peter Tummeltshammer R. Bosch AG / Thales
 Dr. Christoph Scherrer
Alcatel / Thales
 Dr. Eric Armengaud
DecomSys / VirtualVehicle
 Dr. Karl Thaller
DecomSys / Elektrobit Austria
 Dr. Martin Horauer
UAT Technikum Wien
 Paul Milbredt
AUDI AG
A. Steininger
page 3
Outline
• Fault tolerance – some (very) basics
• Automotive electronics: the specific situation
• Design of a cost efficient fault tolerant node
– Basic architecture
– Temporal diversity
– Treatment of common cause faults
– Switching performance mode / safety mode
– Fault-tolerance validation by fault injection
A. Steininger
page 4
Faults, Errors and Failures
fault
1 0
computer
error
failure
A. Steininger
page 5
Error Detection
Fault detection:
usually too difficult
fault (too many possibilities)
1 0
computer
error
failure
A. Steininger
page 6
Error Detection
Failure detection:
too late:
want to prevent failure!
1 0
computer
error
failure
A. Steininger
page 7
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from?
Option 1:
Perform same computation a second time
0
1 0
(hopefully the fault is gone
by then…)
computer
error
Time redundancy
A. Steininger
page 8
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from?
1 0
computer
error
A. Steininger
page 9
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from?
0
Option 2:
Use a second
computer in parallel
(hopefully this one works
well…)
1 0
computer
Space redundancy
A. Steininger
page 10
Error Detection
To decide that „1“ is wrong we need a reference.
Where to get this reference from?
Option 3:
Add additional information
1 0
computer 0
(hopefully not affected as
well…)
Information redundancy
error
A. Steininger
page 11
Achieving Fault Tolerance
computer ED
Fail safe: system can be safely
computer
computer
stopped when error is detected
 example: train
computer
computer ED
ED
computer
computer
computer
Fail operational: system must
keep on working when error is
detected
 example: autopilot in airplane
A. Steininger
page 12
Outline
 Fault tolerance – some (very) basics
• Automotive electronics: the specific situation
• Design of a cost efficient fault tolerant node
– Basic architecture
– Temporal diversity
– Treatment of common cause faults
– Switching performance mode / safety mode
– Fault-tolerance validation by fault injection
A. Steininger
page 13
Electronics in Cars – some Facts
 high proportion of value:
up to 30%
 high development potential:
more than 80% of the innovations
 high number of Electronic Control Units (ECUs)
up to 70
 complex distributed system
different networks & topologies
A. Steininger
page 14
Electronics in Cars - Benefits
 cheap alternative to existing mechanical solutions
– lighter, smaller, cheaper, more flexible,…
 enabler for further optimizations
– electronic ignition, motor management, …
 key to new functionality
–
–
–
–
safety: ESP, active suspension, crash sensing…
comfort: air conditioning, infotainment,…
security: immobilizer, alarm, electronic key, GPS tracking,…
autonomy: anticipatory braking, lane keeping,…
A. Steininger
page 15
Key Demands
Safety
Real-Time
Low Cost
Robustness
Testability
A. Steininger
page 16
Key Demands
Safety
Real-Time
– high risk potential (energy!)
– high public awareness
– no safe state (in general)
Low Cost
– certification required
Robustness
– high complexity of system &
application
Testability
– legal issues (liability)
(EN 61508, ISO 26262)
A. Steininger
page 17
Key Demands
Safety
Real-Time
Low Cost
Robustness
– engine: 6000 rpm = 1/10ms
– VDM: 100km/h = 28cm/10ms
– need to synchronize
distributed activities
– real-time communication
– image processing tasks
Testability
A. Steininger
page 18
Key Demands
Safety
Real-Time
Low Cost
Robustness
– extreme competition
– high cost inhibits introduction
– tailored safety concepts
 minimum degree of replication
 use structural redundancies
– generic solutions
 scalable, configurable, flexible
Testability
– marginal costs beat NRE
A. Steininger
page 19
Current Status
 fail safe functions realized:
– shut off upon error
– mechanical fall-back system assumes control
no true “by wire” functions
– single-channel solutions sufficient
 tolerance against random faults
– avoid design faults by field experience => no diversity
– avoid common cause faults by design (?)
 single fault assumption
– keep faults rare (shielding, etc.)
A. Steininger
page 20
Outline
 Fault tolerance – some (very) basics
 Automotive electronics: the specific situation
• Design of a cost efficient fault tolerant node
– Basic architecture
– Temporal diversity
– Treatment of common cause faults
– Switching performance mode / safety mode
– Fault-tolerance validation by fault injection
A. Steininger
page 21
A Fault Tolerant Node
 mission: make a node (processor) fault tolerant
 need to consider CPU and memory
 aim is “fail safe” (but keep option for fail op in mind)
– simplex unit with error detection capabilities
– duplication and comparison
– hybrid approach
A. Steininger
page 22
Options for the CPU Core
Single core + ED
modify custom CPU core
– parity for buses
Dual core + cmp
Superscalar proc.
+ cmp + ED
– two-rail coding for signals
– self-checking implementation of simple units
– duplicate & compare for
complex units
– careful layout
A. Steininger
page 23
Options for the CPU Core
Single core + ED
duplicate custom CPU core
– master/checker operation
Dual core + cmp
– shared (safe) memory
Superscalar proc.
+ cmp + ED
– self-checking comparator
checks equality of outputs
– validity check for inputs
– option: clock delay
– option: mode switch
A. Steininger
page 24
Solution Example “Dual Core Frame”
 benefits
 can use custom core without modifications
 safety analysis valid for other cores as well
 promises high ED coverage with moderate efforts
 CPU is hard to protect otherwise
 crucial points
 enable easy recovery ( => keep outage short)
 eliminate single points of failure
 detect common cause faults
A. Steininger
page 25
Protection in the Dual Core Frame
Core 1 (Master)
Instr. Addr.
Instr. Data out Data Addr.
Parity for
buses
Data in
Self-Checking
Instr. Mem Comparators
Data
Mem
Dual-Rail
Coding
=?
Instr. Addr.
=?
=?
„Safe memories“
Instr.
Data out
Data Addr.
Error_Sig
Data in
Core 2 (Checker)
A. Steininger
page 26
Potential for Common Cause Faults
 identical input data
 identical clock (lock step)
 shared clock generator
 shared power supply
 both processors on same die
(physical proximity; thermal & mechanical coupling)
A. Steininger
page 27
Temporal Diversity
 operate checker with a delay against master
– same fault hits at different point of computation
– therefore different effect => detect by comparison
– different critical paths emerge
 store master output for comparison
 choose delay of 1 / 1.5 / 2 clock cycles
– larger delay causes high effort for little gain (=>experiments)
– error detection latency is equal to the delay
– need to delay memory write and outputs by this amount
A. Steininger
page 28
Temporal Diversity: Implementation
Core #1 (Master)
Instr. Addr.
Instr.
Data out
Instr. Mem
Data Mem
=?
Instr. Addr.
Data in
Data Addr.
=?
Instr.
Data out
=?
Data Addr.
Error
DT
Data in
Core #2 (Checker)
A. Steininger
page 29
Fail Safe Dual Core Frame – Summary
 safe memories for instructions and data
 comparison of all core outputs
 parity protection for buses (data, address)
 dual rail coding for single signals (int, rst, err)
 totally self-checking comparators
 temporal diversity
How safe is the proposed solution?
A. Steininger
page 30
Assessment of the Solution’s Quality
 How measure quality?
(
Aim is fail safe)
 error detection coverage => detect all errors
 error detection latency => detect them quickly
 Which method to choose?
 theoretical analysis / modelling
 experimental fault injection
 field observation
A. Steininger
page 31
Fault Injection Experiment
 2 SPEAR cores in fail safe frame (= DUT)
 synthesized to EDIF netlist
 injected one by one into netlist
 exhaustive list of stuck-at-1 and stuck-at-0 faults
 download to FPGA, application run
 “golden device” as reference (= REF)
 upon mismatch (DUT  REF) => check comparator
A. Steininger
page 32
Results of FI Experiment
detected no effect
before effect
RD
during effect
WR
RD
after effect
WR
not
no effect
detected with effect
overall
master slave
frame overall
204 51170 3517 54891
19047
98
734 19879
0
0
0
0
559
0
921
1480
31455
0
87 31542
0
0
0
0
4269
4276 1073
9618
0
0
0
0
55534 55544 6332 117410

No change
of memory
contents
in case of error
Erroneous
read access
is uncritical
A. Steininger
page 33
Enabling fast Recovery
 error signal (dual rail)
 notifies external component / memory
 turns any further WR into RD (error confinement)
 triggers processor interrupt
 status register (memory mapped)
 updated by HW
 indicates source of error (data parity, address mismatch,…)
 recovery
 can build on uncorrupted status
 can benefit from detailed status information
A. Steininger
page 34
Why is fast Recovery important?
 application specific fault-tolerance time
 application can “survive” without computer
 even in fail-operational case
 typ. some 10ms for car (recall: 100km/h = 28cm/10ms)
 meaning of fast recovery
 if failed computer recovers within FT time,
no need for hot standby => COST!
 re-booting after failure is
- pragmatic
- safe
- expensive!
A. Steininger
page 35
Fail Safe Dual Core – Summary 1
 duplicate & compare
 generic approach, applicable to any core type
 covers all (local) errors
 need to carefully eliminate single points of failure
 need to complement with protection for signals & buses
 temporal diversity
 mitigates (many) common cause failures
 requires output delay to ensure error confinement
A. Steininger
page 36
Possible Sources of CCFs
 Design & process
 design fault or (latent) process deficiency
 Thermal coupling
 hot spot affects both replica in the same way
 Mechancial defect
 affects both replica symmetrically
Electrical coupling
 wire bound (shared lines: VDD, reset, clock)
 wireless (EMI)
A. Steininger
page 37
Why use Single Die then?
 cheaper and faster
use two instances of same design
fast & comprehensive comparison
 CCFs on single die
intuitively higher thread
quantification of thread?
mitigation techniques?
A. Steininger
Komp.
error
page 38
The Actual Problem with CCFs
 One fault event affects both replica
AND
 is not detected by comparator
i.e. leads to “symmetric” fault effect
AND
produces an erroneous output
i.e. does not crash the cores
A. Steininger
page 39
Possible Countermeasures for CCFs
diversity, burn-in,
fault avoidance
 design fault or (latent) process deficiency
 Design & process
 Thermal coupling
 hot spot affects both replica in the asymmetric
same way
 Mechancial defect
propagation paths
 affects both replica symmetrically
Electrical coupling
asymmetric
critical
paths
 wire bound (shared lines: VDD, reset,
clock)
 wireless (EMI)
asymmetric
antennas (?)
A. Steininger
page 40
Possible Countermeasures for CCFs
 Design & process
 design fault or (latent) process deficiency
 Thermal coupling
 hot spot affects both replica in the asymmetric
same way
 Mechancial defect
propagation paths
 affects both replica symmetrically
Electrical coupling
 wire bound (shared lines: VDD, reset, clock)
 wireless (EMI)
A. Steininger
page 41
Propagation Speed Comparison
Thermal & mechanical
propagation are
relatively slow
10000s of clock cycles
within 1ms
A. Steininger
page 42
Experimental Assessment
Erroneous
write
access?
 Evaluation Experiments
Addr
We
Data
Master
Iaddr
1) single corresponding points
with offset t Core 1
Core 2
A. Steininger
Golden
Node
We
Checker
Addr
3) single non-corresp. points
Core 1
Core 2
no offset
Data
Compare unit
Iaddr
2) multiple corresp. points
with offset t Core 1
Core 2
page 43
Symmetry Requirements for CCF
even a small offset…
fault multiplicity …
asymmetry of impact …
…improve detection
coverage
A. Steininger
page 44
Symmetry Requirements for CCF
even a small offset…
PSW (308)
ALU (2472)
fault multiplicity …
ExVecTab
(8202)
asymmetry of impact …
P2 (158)
DEC (152)
RF (7028)
…improve detection
coverage
PC+P1
(182)
A. Steininger
page 45
Squeezing our more Efficiency
 dual core is expensive 
 normally yields performance improvement
 would be welcome here as well:
increasing performance demand @ limited clock rates
 but: exclusively dedicated to safety here
 observation: not all tasks are safety critical
enable flexible switching between
“safety mode” and “performance mode”
A. Steininger
page 46
Operation in Performance Mode
 cores execute different instruction streams in parallel
 both cores have direct access to memory / peripherals
 instruction caches introduced to minimize penalties from
conflicting access
 temporal diversity disabled
 comparator disabled
A. Steininger
page 47
Requirements on the Mode Switching
 coherent operation in safety mode
 internal states of cores must be aligned before
switching to safety mode (register file, cache)
 safe operation in safety mode
 switching must not introduce safety leakage
 no corruption of safety-relevant data in perform. mode
 low performance penalty for mode switching
 slow or complicated switching would spoil the
anticipated performance gain
A. Steininger
page 48
Implementation of the Split Core Frame
Core 1
Instruction
address
Wait
Signal
Instruction
Instructioncache
Interrupt
clk
Data
Data
address out
Data
in
ModeSwitch
Detect
Address
with parity
Address
safe
instruction
memory
Adress parity
Instruction
parity
Instruction
RAM
Control
mode
switch
ModeSwitch
mode
switch
Data
RAM
Control
Data
with parity
safe
data
memory
Instruction
Instructioncache
Instruction
address
Instruction
Data
with parity
ModeSwitch
Detect
Wait
Signal
Interrupt
clk
Data
Data
address out
Data
in
Core 2
A. Steininger
page 49
Mode Switch: Safety => Performance
load ID reg address
mode switch instr
=> core1 wait
=> core2 wait
=> clk align
=> switch mode
LDL r1, 248
LDH r1, 255
mode switching
LDW r2, r1
BTEST r2, 1
JMPI_CT
clk
core1 signal
wait1
message2
clk_core2
core2 signal
wait2
message1
load & check ID bit
=> cond branch core2
status
A. Steininger
safety mode
page 50
Mode Switch: Performance => Safety
core1 encounters
mode switch instr
clk
=> trigger MSU
signal)
core1(core1
signal
=> halt core1
wait1 (wait1)
=> interruptmessage2
core2 (message2)
core2 encounters
clk_core2interrupt
=> save context
core2 signal
=> jump to mode
wait2 switch instr
message1
core2 executes mode switch
=> halt core2 status
& switchsafety
clock
mode
=> resume core1
=> resume core2 after delay
clk
core1 signal
wait1
message2
clk_core2
core2 signal
wait2
message1
status
A. Steininger
performance mode
safety mode
safety mode
page 51
perf
Fault Injection in Safety Mode
master
detected
overall
frame
overall
no effect
1029
56962
5334
63325
before effect
5026
0
1324
6350
within 1,5cy
50956
0
569
51525
0
0
0
0
7055
7102
4275
18432
0
0
0
0
64066
64064
later
not
detected
slave
no effect
with effect

11502 139632
Delayed WR still ensures error confinement
A. Steininger
page 52
Fault Injection in Performance Mode
fault injected in performance mode, then switch to safety mode
detection in
effect in
perf only
both modes
safety only
none
perf mode
safety mode
early
late
stuck ≤1.5cy >1.5cy never
1149
423 25617
34583
458
---0
0
0
---9654
0
0
1473
47715
18560
No
undetected
effectsto/ prevent
late detections
mode
Watchdog
important
hang-upininsafety
perf mode
A. Steininger
page 53
We still need a “Safe Memory”
 detect bit flips in storage cells Why not duplicate
 parity (or EDC/ECC)
& compare?
 detect erroneous address decoding
 special decoder logic design
 protect interfaces
 parity for data, address and control buses
 prevent illegal WR access
 provide mask input for write enable
A. Steininger
page 54
We still need a “Safe Memory”
 detect bit flips in storage cells
 parity (or EDC/ECC)
 detect erroneous address decoding
 special decoder logic design
 protect interfaces
 parity for data, address and control buses
 prevent illegal WR access
 provide mask input for write enable
A. Steininger
page 55
Possible Address Decoder Errors
 correct behavior:
 any given address activates exactly
one assigned memory cell
 erroneous behaviors:
an address activates no memory cell at all
an address activates more than one memory cell
an address activates a wrong memory cell
A. Steininger
page 56
Checking the Address Decoder
A2
&
A1
&
check for missing or multiple cell activations:
XOR(upper half)  XOR(lower half) ?
pe
memory cell array
A0
&
XOR
AP
dual-rail
checker
&
&
&
&
XOR
dual-rail
checker
re-check parity behind cell array:
OR over even cells  parity ?
&
large decoders built from cascade of smaller ones
A. Steininger
page 57
Summary
 the automotive domain has its own laws and rules
 need “extremely cost-effective robust solutions for safetycritical real-time applications, versatile and custom tailored”
 on node level
 different redundancy concepts applicable
 example: dual core CPU and memory with protection mech’s
 on-line testing for memory may be required
 on system level
 crucial role of communication infrastructure
 advantages of time triggered approach
 insufficient suitability of structural testing
A. Steininger
page 58
Hungry for more?
http://ti.tuwien.ac.at/ecs
[email protected]
A. Steininger
page 59
Related publications of my group (1)
[1] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”, 9th
IEEE European Test Symposium, Corsica 2004.
[2] T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error
Containment”, Journal of Computing and Informatics, vol. 23, no.5, 2004.
[3] T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”, Int’l
Conference on Dependable Systems and Networks (DSN2006), Philadelphia, 2006.
[4] T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”, 13th
IEEE Pacific Rim Int’l Symposium on Dependable Computing, Melbourne, 2007.
[5] C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery
Mechanisms for Dual Core Architectures”, 21st IEEE Int’l Symposium on Defect and
Fault Tolerance in VLSI Systems (DFT’06), Washington, 2006.
[6] A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”,
IEEE Design & Test of Computers, vol 3/1999
[7] A. Steininger, “Testing and Built-in Self-Test – A Survey”, Journal of Systems
Architecture 46(2000)
A. Steininger
page 60
Related publications of my group (2)
[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical
Applications – A Case Study”, 29th Annual Int’l Symposium on Fault-Tolerant
Computing (FTCS’29), Madison, 1999.
[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault
Tolerance?”, 2000 IEEE International Symposium on Defect and Fault Tolerance in
VLSI Systems (DFT’00), Yamanashi, 2001.
[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”, 2001
IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
(DFT’01), San Francisco, 2001
[11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded FaultTolerant Computer System”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.
[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous
Detection of Functional Faults and Soft Errors in Memories”, IEEE Transactions on
Reliability, vol. 52, no. 4, 2003.
A. Steininger
page 61
Related publications of my group (3)
[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner,
“A Structured Approach for the Systematic Test of Embedded Automotive
Communication Systems”, Int’l Test Conference 2005, Austin 2005.
[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in
FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on
Emerging Technologies and Factory Automation, Prague 2006.
[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded
Automotive Communication Systems“, IEEE Transactions on Industrial Informatics vol
4, no 3
[16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for
System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic
Design, Test and Applications, Hong Kong, 2008.
[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in
FlexRay“, Proc. 3rd IEEE Symposium on Industrial Embedded Systems, Las Vegas,
2008.
A. Steininger
page 62
Related publications of my group (4)
[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause
Faults — Experimental Assessment of Potential Countermeasures“, 9th IEEE
International Conference on Dependable Systems and Networks, Estoril, 2009.
[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the
Systematic Test of Time-Triggered Automotive Communication Systems”, 5th IEEE
Int’l Workshop on Factory Communication Systems, Vienna, 2004.
[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in
FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on
Emerging Technologies and Factory Automation, Prague 2006.
[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in
Embedded Real-Time Networks”, 6th IEEE Int’l Workshop on Factory Communication
Systems, Torino, 2006.
[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for
System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic
Design, Test and Applications (DELTA 2008), Hong Kong, 2008.
A. Steininger
page 63
Related PhD theses of my group
T. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für
sicherheitsrelevante Automobilanwendungen”,
PhD thesis, Vienna University of Technology, 2005. (German)
C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer
Berücksichtigung latenter Fehler”,
PhD thesis, Vienna University of Technology, 2002. (German)
K. Thaller, “A Transparent Online Memory Test”,
PhD thesis, Vienna University of Technology, 2001.
E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication
Protocols”, PhD thesis, Vienna University of Technology, 2008.
P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core
Architectures”, PhD thesis, Vienna University of Technology, 2009.
G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization:
Concepts, Implementations and Evaluations”,
PhD thesis, Vienna University of Technology, 2009
A. Steininger
page 64
Related Projects
STEACS (Systematic Test of Embedded Automotive Communication Systems)
http://embsys.technikum-wien.at/projects/steacs/index.html
EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing)
http://ti.tuwien.ac.at/ecs/research/projects/extract
DARTS (Distributed Algorithms for Robust Tick Synchronization)
http://ti.tuwien.ac.at/ecs/research/projects/DARTS
A. Steininger
page 65

similar documents