Safety-Critical Systems - TKK / Laboratory for Theoretical

Report
Safety-Critical Systems 2
T 79.232
Risk analysis and design for safety
Ilkka Herttua
V - Lifecycle model
Requirements Model
Requirements
Document
Test Scenarios
Knowledge Base *
Test Scenarios
Requirements
Analysis
Functional /
Architechural - Model
Systems
Analysis &
Design
Specification
Document
Software
Design
System
Acceptance
System
Integration & Test
Module
Integration & Test
Software
Implementation
& Unit Test
* Configuration controlled Knowledge
that is increasing in Understanding
until Completion of the System:
• Requirements Documentation
• Requirements Traceability
• Model Data/Parameters
• Test Definition/Vectors
1
Concept
2
System Definition and
Application Conditions
3
Risk Analysis
Overall safety lifecycle
4
System Requirements
5
Apportionment of
System Requirements
6
Design and Implementation
7
Manufacture
8
Installation
9
System Validation (including
Safety Acceptance and
Commissioning)
10
System Acceptance
12
Performance Monitoring
11
Operation and Maintenance
14
Decommissioning and Disposal
13
Modification and Retrofit
Re-apply Lifecycle
(See note)
Note: The phase at which a modification enters the life-cycle will be dependent upon both the system
being modified and the specific modification under consideration.
Risk Analysis
• Risk is a combination of the severity
(class) and frequency (probability) of the
hazardous event.
• Risk Analysis is a process of evaluating
the probability of hazardous events.
• The Value of life??
Value of life is estimated between 0.75M –2M GBP.
USA numbers higher.
Risk Analysis
• Classes:
- Catastrophic – multiple deaths >10
- Critical – a death or severe injuries
- Marginal – a severe injury
- Insignificant – a minor injury
• Frequency Categories:
Frequent 0,1 events/year
Probable 0,01
Occasional 0,001
Remote
0,0001
Improbable 0,00001
Incredible 0,000001
Hazard Analysis
• A Hazard is situation in which there is
actual or potential danger to people or to
environment.
• Analytical techniques:
- Failure modes and effects analysis (FMEA)
- Failure modes, effects and criticality analysis (FMECA)
- Hazard and operability studies (HAZOP)
- Event tree analysis (ETA)
- Fault tree analysis (FTA)
Fault Tree Analysis 1
The diagram shows a heater controller for a
tank of toxic liquid. The computer controls
the heater using a power switch on the basis
of information obtained from a temperature
sensor. The sensor is connected to the
computer via an electronic interface that
supplies a binary signal indicating when the
liquid is up to its required temperature. The
top event of the fault tree is the liquid being
heated above its required temperature.
Fault event not
fully traced to its source
Basic event, input
Fault event resulting
from other events
OR connection
Risk acceptability
•
National/international decision – level of an acceptable loss
(ethical, political and economical)
Risk Analysis Evaluation:
ALARP – as low as reasonable practical (UK, USA)
“Societal risk has to be examined when there is a possibility of a
catastrophe involving a large number of casualties”
GAMAB – Globalement Au Moins Aussi Bon = not greater than before
(France)
“All new systems must offer a level of risk globally at least as good as
the one offered by any equivalent existing system”
MEM – minimum endogenous mortality
“Hazard due to a new system would not significantly augment the figure
of the minimum endogenous mortality for an individual”
Risk acceptability
Tolerable hazard rate (THR) – A hazard rate which guarantees that the
resulting risk does not exceed a target individual risk
SIL 4 =
SIL 3 =
SIL 2 =
SIL 1 =
10-9
10-8
10-7
10-6
< THR < 10-8
< THR < 10-7
< THR < 10-6
< THR < 10-5
per hour and per function
Potential Loss of Life (PLL) expected number of casualties per year
Current situation / critical systems
•
a)
b)
c)
d)
Based on the data on recent failures of critical systems,
the following can be concluded:
Failures become more and more distributed and often
nation-wide (e.g. commercial systems like credit card
denial of authorisation)
The source of failure is more rarely in hardware
(physical faults), and more frequently in system design
or end-user operation / interaction (software).
The harm caused by failures is mostly economical, but
sometimes health and safety concerns are also involved.
Failures can impact many different aspects of
dependability (dependability = ability to deliver service
that can justifiably be trusted).
Examples of computer failures in
critical systems
Driving force: federation
• Safety-related systems have traditionally been based on the
idea of federation. This means, a failure of any equipment
should be confined, and should not cause the collapse of
the entire system.
• When computers were introduced to safety-critical
systems, the principle of federation was in most cases kept
in force.
• Applying federation means that Boeing 757 / 767 flight
management control system has 80 distinct
microprocessors (300, if redundancy is taken into account).
Although having this number of microprocessors is no
longer too expensive, there are other problems caused by
the principle of federation.
Designing for Safety
• Faults groups:
- requirement/specification errors
- random component failures
- systematic faults in design (software)
• Approaches to tackle problems
- right system architecture (fault-tolerant)
- reliability engineering (component, system)
- quality management (designing and producing processes)
Designing for Safety
• Hierarchical design
- simple modules, encapsulated functionality
- separated safety kernel – safety critical functions
• Maintainability
- preventative versa corrective maintenance
- scheduled maintenance routines for whole lifecycle
- easy to find faults and repair – short MTTR mean time to repair
• Human error
- Proper HMI
Hardware Faults
Intermittent faults
- Fault occurs and recurrs over time (loose
connector)
Transient faults
- Fault occurs and may not recurr (lightning)
- Electromagnetic interference
Permanent faults
- Fault persists / physical processor failure
(design fault – over current)
Fault Tolerance
• Fault tolerance hardware
- Achieved mainly by redundancy
Redundancy
- Adds cost, weight, power consumption,
complexity
Other means:
- Improved maintenance, single system with better
materials (higher MTBF)
Redundancy types
Active Redundancy:
- Redundant units are always
operating.
Dynamic Redundancy (standby):
- Failure has to be detected
- Changeover to other modul
Hardware redundancy techniques
Active techniques:
- Parallel (k of N)
- Voting (majority/simple)
Standby :
- Operating - hot stand by
- Non-operating – cold stand by
Reliability prediction
• Electronic Component
- Based on propability and statictical
- MIL-Handbook 217 – experimental data
on actual device behaviour
- Manufacture information and allocated
circuit types
-Bath tube curve; burn in – useful life –
wear out
Safety-Critical Hardware
Fault Detection:
- Routines to check that hardware works
- Signal comparisons
- Information redundancy –parity check
etc..
- Watchdog timers
- Bus monitoring – check that processor
alive
- Power monitoring
Safety-Critical Hardware
Possible hardware:
COTS Microprocessors
- No safety firmware, least assurance
- Redundancy makes better, but common
failures possible
- Fabrication failures, microcode and
documentation errors
- Use components which have history and
statistics.
Safety-Critical Hardware
Special Microprocessors
- Collins Avionics/Rockwell AAMP2
- Used in Boeing 747-400 (30+ pieces)
- High cost – bench testing,
documentation, formal verification
- Other models: SparcV7, TSC695E,
ERC32 (ESA radiation-tolerant),
68HC908GP32 (airbag)
Safety-Critical Hardware
Programmable Logic Controllers PLC
• Contains power supply, interface and one or
more processors.
• Designed for high MTBFs
• Firmware
• Programm stored in EEPROMS
• Programmed with ladder or function block
diagrams
Safety management
• Safety culture/policy of the organisation
- Task for management ( Targets )
• Safety planning
- Task for safety manager ( How to )
• Safety reporting
- All personal
- Safety log / validation reports
Home assignments
• 4.18 (tolerable risk)
• 5.10 (incompleteness within specification)
Email before 2. March to [email protected]

similar documents