Report

Data collection and analysis Jørn Vatn NTNU 1 Objectives data collection and analysis Collection and analysis of safety and reliability data is an important element of safety management and continuous improvement There are several aspects of utilizing experience data and we will in the following focus on 1. Learning from experience 2. Identification of common problems “Top ten”-lists (visualized by Pareto diagrams) 3. A basis for estimation of reliability parameters MTTF, MDT, aging parameters 2 Collection of data We differentiate between Accident and incident reporting systems These data is event-based, i.e. we report into the system only when critical events occur Examples of such system is Synergy, and Tripod Delta Databases with the aim of estimating reliability parameters These databases contains system description, failure events, and maintenance activities The Offshore Reliability Data (OREDA) is one such database Such databases will be denoted RAMS databases in the following 3 RAMS data: Boundary description A clear boundary description is imperative for collecting, merging and analyzing RAMS data from different industries, plants or sources Recycle valve INTERSTAGE CONDITIONING INLET GAS CONDITIONING Inlet valve The merging and analysis STARTING SYSTEM will otherwise be based on incompatible data. For each equipment class a boundary must be defined. The boundary defines what RAMS data are to be collected DRIVER POWER TRANSMISSION LUBRICATION SYSTEM Coolant Boundary 4 COMPRESSOR UNIT 1 st STAGE 2nd STAGE CONTROL AND MONITORING Power Remote instr. AFTER COOLER Outlet valve SHAFT SEAL SYSTEM Power Coolant MISCELLANEOUS Hardware classification Boundary classification RAMS Data: Equipment hierarchy Maintainable item 5 - Turnout 2 Boundary level Turnout 1 Turnout i Switch mechanism Electrical motor (Turnout contains several subunits) (Switch mechanism contains several Maintainable items) Sub-boundary level Subunit Equipment unit The highest level is the equipment unit class The number of levels for subdivision will depend on the complexity of the equipment unit and the use of the data Turnout 3 Maintainable item level Equipment class Turnout n Data categories Equipment data Failure data Maintenance data State information 6 RAMS database structure Maintenance ... Maintenance 3 Maintenance 2 Failure .. Maintenance 1 Failure 2 Failure 1 Inventory .. Inventory 2 Inventory 1 State information 7 Equipment data Identification data; e.g. equipment location classification installation data equipment unit data; Design data; e.g. manufacturer’s data design characteristics; Application data; e.g. operation, environment 8 Equipment data (Adapted from ISO 14224) Main categories Sub-categories Data Identification Equipment location - Equipment tag number (*) Classification - Equipment unit class e.g. (*) Equipment type (see Annex A) (*) - Application (see Annex A)(*) Installation data Country Line (from A to B) Type of line e.g. double track, high speed line Type of track e.g. main track Equipment unit data - Manufacturer’s data - Manufacturer’s name (*) - Manufacturer’s model designation (*) Design characteristics - Relevant for each equipment class e.g. turnout radius, current feeder voltage, see Annex A (*) Design Equipment unit description (nomenclature) Unique number e.g. serial number Subunit redundancy e.g. no of redundant subunits Cost data Application Remarks Operation (normal use) - Mode while in the operating state, e.g. continuous running, standby, normally closed/open, intermittent - Date the equipment unit was installed or date of production start-up - Surveillance period (calendar time)(*) - The accumulated operating time during the surveillance period - Number of demands during the surveillance period as applicable - Operating parameters as relevant for each equipment class e.g. number of trains passing per hour, see Annex A Environmental factors External environment (severe, moderate, benign) a Additional information - Additional information in free text as applicable 9 Failure data identification data, failure record and equipment location; failure data for characterizing a failure, e.g. failure date, maintainable items failed, severity class, failure mode, failure cause, method of observation 10 Failure data (From ISO 14224) Category Data Description Identification Failure record (*) Unique failure identification Equipment location (*) Tag number Failure date (*) Date the failure was detected (year/month/day) Failure mode (*) At equipment unit level as well as at maintainable item level) Impact of failure on operation Detailed list exist Severity class (*) Effect on equipment unit function: critical failure, noncritical failure Failure descriptor The descriptor of the failure (see Table 18) Failure cause The cause of the failure (see Table 19) Subunit failed Name of subunit that failed (see examples in Annex A) Maintainable Item(s) failed Specify the failed maintainable item(s) (see examples in Annex A) Method of observation How the failure was detected (see Table 20) Additional information Give more details, if available, on the circumstances leading to the failure, additional information on failure cause etc. Failure data Remarks 11 Failure causes (Failure descriptors, From ISO 14224) No. Notation Description 1.0 Mechanical failuregeneral A failure related to some mechanical defect, but where no further details are known 1.1 Leakage External and internal leakages, either liquids or gases. If the failure mode at equipment unit level is leakage, a more causal oriented failure descriptor should be used wherever possible 1.2 Vibration Abnormal vibration. If the failure mode at equipment level is vibration, a more causal oriented failure descriptor should be used wherever possible 1.3 Clearance/ alignment failure Failure caused by faulty clearance or alignment 1.4 Deformation Distortion, bending, buckling, denting, yielding, shrinking, etc. 1.5 Looseness Disconnection, loose items 1.6 Sticking Sticking, seizure, jamming due to reasons other than deformation or clearance/alignment failures 12 Failure causes, cont No. Notation Description 2.0 Material failuregeneral A failure related to a material defect, but no further details known 2.1 Cavitation Relevant for equipment such as pumps and valves 2.2 Corrosion All types of corrosion, both wet (electrochemical) and dry (chemical) 2.3 Erosion Erosive wear 2.4 Wear Abrasive and adhesive wear, e.g. scoring, galling, scuffing, fretting, etc. 2.5 Breakage Fracture, breach, crack 2.6 Fatigue If the cause of breakage can be traced to fatigue, this code should be used 2.7 Overheating Material damage due to overheating/burning 2.8 Burst Item burst, blown, exploded, imploded, etc. 13 Failure causes, cont No. Notation Description 3.0 Instrument failure – Failure related to instrumentation, but no details known general 3.1 Control failure 3.2 No signal/indication/alarm No signal/indication/alarm when expected 3.3 Faulty signal/indication/alarm Signal/indication/alarm is wrong in relation to actual process. Could be spurious, intermittent, oscillating, arbitrary 3.4 Out of adjustment Calibration error, parameter drift 3.5 Software failure Faulty or no control/monitoring/operation due to software failure 3.6 Common mode failure Several instrument items failed simultaneously, e.g. redundant fire and gas detectors 14 Failure causes, cont No. Notation Description 4.0 Electrical failuregeneral Failures related to the supply and transmission of electrical power, but where no further details are known 4.1 Short circuiting Short circuit 4.2 Open circuit Disconnection, interruption, broken wire/cable 4.3 No power/ voltage Missing or insufficient electrical power supply 4.4 Faulty power/voltage Faulty electrical power supply, e.g. over voltage 4.5 Earth/isolation fault Earth fault, low electrical resistance 15 Failure causes, cont No. Notation Description 5.0 External influence – general The failure where caused by some external events or substances outside boundary, but no further details are known 5.1 Blockage/plugged Flow restricted/blocked due to fouling, contamination, icing, etc. 5.2 Contamination Contaminated fluid/gas/surface e.g. lubrication oil contaminated, gas detector head contaminated 5.3 Miscellaneous external influences Foreign objects, impacts, environmental, influence from neighbouring systems 6.0 Miscellaneous – generala Descriptors that do not fall into one of the categories listed above. 6.1 Unknown No information available related to the failure descriptor. 16 Maintenance data Maintenance is carried out To correct a failure (corrective maintenance); As a planned and normally periodic action to prevent failure from occurring (preventive maintenance). 17 Maintenance data (From ISO 14224) Category Data Description Identification Maintenance record (*) Unique maintenance identification Equipment location (*) Tag number Failure record (*) Corresponding failure identification (corrective maintenance only) Date of maintenance (*) Date when maintenance action was undertaken Maintenance category Corrective maintenance or preventive maintenance Maintenance activity Description of maintenance activity (see Table 21) Impact of maintenance on operation Zero, partial or total, (safety consequences may also be included) Subunit maintained Name of subunit maintained (see Annex A) NOTE - For corrective maintenance, the subunit maintained will normally be identical with the one specified on the failure event report Maintainable item(s) maintained Specify the maintainable item(s) that were maintained (see Annex A) Spare parts Spare parts required to restore the item Cost of spare parts, or links to a cost structure database.. Maintenance man-hours, per discipline Maintenance man-hours per discipline (mechanical, electrical, instrument, others) Maintenance man-hours, total Total maintenance man-hours. Active maintenance time Time duration for active maintenance work on the equipment Down time The time interval during which an item is in a down state Maintenance data Maintenance resources Maintenance time 18 State information State information (condition monitoring information) may be collected in the following manners: Readings and measurements during maintenance Observations during normal operation Continuous measurements by use of sensor technology 19 State information, discrete readings Category Data Description Identification State information record Unique state information identification Equipment location Tag number Maintenance record Corresponding maintenance identification, i.e. an observation is recorded either related to corrective or preventive maintenance Failure record Corresponding failure identification (if no maintenance is performed in relation to the failure) Date of observation Date when state information was read Type of measurement What measurement is obtained? For example a distance measure, Value What are the readings of the measurement? Additional information Give more details State information Remarks If the readings are taken during normal operation, there will not be a corresponding maintenance or failure record. In this case the state information is linked directly to the inventory record 20 State information, continuous readings Category Data Description Identification State information record Unique state information identification Equipment location Tag number Type of measurement What measurement is obtained? For example a distance measure, Sampling frequency What is the sampling frequency? Sensor What type of sensor is used Data compression principle How is data compressed, e.g. Fast Fourier Transform Additional information Give more details State information Remarks State information is linked directly to the inventory record for continuous readings 21 Data analysis Graphical techniques Histogram Bar charts Pareto diagrams Visualization of trends Parametric models Estimation of constant failure rate Estimation of increasing hazard rate Estimation of global trends (over the system lifecycle) 22 Pareto diagram (“Top ten”, components) 30 % Contribution to delay time [%] 25 % 20 % 15 % 10 % 5% 0% 23 Presenting raw data and rates from accident and incident reporting systems When presenting a “snapshot” of the indicators we often compare with targets value Colour codes may be used For “occurrences” we just plot the raw data For frequencies we need to establish the “exposure” Number of working hours in the period Number of critical work operations 24 Cross-tabulation To see the effect of explanatory variables we could plot the number of occurrence or frequencies as a function of one or two explanatory variables we get an indication whether the risk is unexpected high among certain groups of workers, during specific work operations, in special periods etc 25 Example of cross tabulation (dummy figures) Onshore Offshore ONGC employees 4 per 106 hrs 7.1 per 106 hrs Contractors 3 per 106 hrs 2.4 per 106 hrs Sub-contractors 8.2 per 106 hrs 12 per 106 hrs 26 Root cause analysis The objective is to present the contributing factors to the HSE indicators Occurrences and/or frequencies are plotted against the causation codes, see next slide Challenges How to treat more than one causation code? Causation codes are organised in a structure 27 Causation codes in an MTO structuring Triggering factors Underlying causes Work organisation Work supervision Change routines Communication Working environment Requirements/procedures/guidelines Management of company/entity Deficient safety culture Poor quality of established systems 28 KEP 2005 HSEavvik deviations factor HMS fordeltper pr. triggering utløsende årsaker 300 250 200 150 100 50 0 Iverks ikke Unnlot å Uryd. arb. Brukte Mangelfull tilstr. sikr. inform./var pl./mangl. utst./verkt. skilting/ av arb.pl. sle/komm. renhold på feil m. avskj. Arb.pl. Brukte Mangelf. Brudd på Oppf. ikke Feil el. Feilpl. gjenLøse gjenMangelf. var/ble ikke korr. kv.kontr./ trafikk- sign./tegn/ svikt i stander stander verneutstyr ikke tilr. pers. verne verif. av regler skilt utst/tekn. 2003 15 4 11 11 2 8 8 8 1 6 7 5 1 4 2004 262 251 124 221 163 168 169 135 97 48 79 74 48 37 2005 201 81 178 69 78 54 33 62 106 126 68 70 81 68 29 Trend curves, three alternatives 1. 2. 3. Plot number of occurrences as a function of time (histogram) Plot frequencies (number/exposure) as a function of time Plot both number and exposure as a function of time in the same diagram 30 KEP 2005 HMS avvik fordelt pr. kvartal Quarterly HSE deviation 700 900000 800000 Exposure (hours worked) 600 700000 500 600000 400 300 500000 Incidents 400000 300000 200 200000 100 100000 0 Antall registreringer Arbeidstimer 2003 1 2003 2 2003 3 2003 4 2004 1 2004 2 2004 3 2004 4 2005 1 2005 2 2005 3 2 2 22 42 67 334 635 646 518 451 216 270743 127600 182706 297263 523307 774040 694858 850900 769685 402764 279959 31 0 Challenges Difficult to see trends due to the stochastic nature of the number of events As an alternative, plot cumulative number of events as a function of time (adjusted for exposure) Convex plot indicates increasing risk level Concave plot indicates an improving situation The following example is based on the previous plot 32 20 03 -1 20 03 -2 20 03 -3 20 03 -4 20 04 -1 20 04 -2 20 04 -3 20 04 -4 20 05 -1 20 05 -2 20 05 -3 0.006 Cumulative number of deviations 0.005 0.004 0.003 0.002 0.001 0 33 Interpretation of cumulative plot A convex plot indicates an increasing frequency of incidents () A concave plot indicates improvement () 34 Note Cross-tabulation and trend curves are used to focus on safety problems, but do not indicate improvement measures Root cause analysis identifies significant causes behind the undesired events/accidents cue on measures Risk reducing measure should be based on an understanding of That the measure is directed against one or more failure causes (causation code) That the measure is effective in terms of e.g., cost That no negative effects of the measure is anticipated 35 • The bathtub curve is a basis for reliability modelling, but Failure rate Parameter estimation /bathtub curve • There are two such curves • The hazard rate for ”local time” • The failure intensity for ”global time” • Combining the two: 36 Time Failure intensity/ Performance loss Performance loss 1 Local time Local time Local time 4 2 3 Global (system) time 37 Plotting techniques, lifetime data (local bath tubcurve) Several plots exists to visualize characteristics of lifetime data TTT-plot Kaplan-Meier plot 1 Hazard plot 2 T1 T2 T3 3 All these plots assume T4 4 Failure times are identical, T5* 5 and independent distributes I.e. no change over system lifetime T6 6 7 t=0 Examples of how life times are generated are shown to the right 38 T7 End TTT- Total Time on Test plot Let T1,T2,T3,..,Tn be the recorded lifetimes Let T(1),T(2),T(3),.. be ordered lifetimes, i.e. T(1) T(2)T(3).. Define the total test on time at time t by i TTT ( t ) = T + ( n - i )t ( j ) j= 1 where i is such that T(i) t < T(i+1) The TTT-plot is obtained by plotting for i = 1,..,n: i TTT ( T ( i ) ) , n TTT ( ) T ( n ) 39 Example ST(i) T(i) i 0 0 0 1 2 3 4 5 6 7 8 9 10 11 6000.00 8000.00 12000.00 14000.00 16000.00 18000.00 19000.00 20000.00 23000.00 24000.00 27000.00 6000.00 14000.00 26000.00 40000.00 56000.00 74000.00 93000.00 113000.00 136000.00 160000.00 187000.00 ST(i)+(n-i)T(i) i/n TTT Transform 0 0 0 0 0.02 0.310582 66000 0.09 0.35 0.51 86000 0.18 0.46 0.63 122000 0.27 0.65 0.71 138000 0.36 0.74 0.78 152000 0.45 0.81 0.83 164000 0.55 0.88 0.87 169000 0.64 0.90 0.91 173000 0.73 0.93 0.94 182000 0.82 0.97 0.96 184000 0.91 0.98 0.98 187000 1.00 1.00 1.00 40 Example plot 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 41 0.8 1 Interpretation A plot around the diagonal indicates a constant hazard rate, i.e. failure times can be considered exponentially distributed. A concave plot (above the diagonal) indicates an increasing hazard rate (IFR). A convex plot (under the diagonal) indicates a decreasing hazard rate (DFR). A plot which fist is convex, and then concave indicates a bathtub like hazard rate A plot which first is concave, and then convex indicates heterogeneity in the data, see Vatn (1996). 42 Exercise Assume that the following failure data for one component type has been recorded (in months) 8,9,7,6,12,18,14,6,9,11,24 Construct the TTT plot What would you say about the hazard rate? 43 The Nelson Aalen plot for global trend over the system lifetime The Nelson-Aalen plot shows the cumulative number of failures on the Y-axis, and the X-axis represents the time A convex plot indicates a deteriorating system, whereas a concave plot indicates an improving system The idea behind the Nelson-Aalen plot is to plot the cumulative number of failures against time Actually we plot W(t) which is the expected cumulative numbers of failures in a time interval 44 Nelson Aalen procedure When estimating W(t) we need failure data from one or more processes (systems) , Each process (system) is observed in a time interval (ai,bi] and tij denotes failure time j in process i (global or calendar time) To construct Nelson Aalen plot the following algorithm could be used Group all the tij’s and sort them, and denote the result tk, k = 1,2,….. For each k, let Ok denote the number of processes that are under observation just before time tk Let Wˆ 0 0 Let,Wˆ k Wˆ k 1 1 / O k k = 1,2,… ˆ Plot ( t k , W k ) 45 Example of Nelson Aalen plot ai bi tij 0 50 7, 20, 35, 44 20 60 26, 33, 41, 48, 57 40 100 50, 60, 69, 83, 88, 92, 99 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 46 Parameter estimation Constant hazard rate, homogeneous sample Constant hazard rate, non-homogeneous sample Increasing hazard rate 47 Constant failure rate homogeneous sample In this situation we only need the following information t = aggregated time in service n = the total number of observed failures in the period An estimate for the failure rate is given by ˆ Number Aggregated of failures time in service n t 48 Multi-Sample Problems In many cases we do not have a homogeneous sample of data The aggregated data for an item may come from different installations with different operational and environmental conditions, or we may wish to present an “average” failure rate estimate for slightly different items In these situations we may decide to merge several more or less homogeneous samples, into what we call a multisample The various samples may have different failure rates, and different amounts of data 49 Illustration, multi-sample data Sample 1 2 3 Uncertainty limits k Total 1 2 3 4 5 6 7 50 8 9 10 11 12 Failure rate (failures per 104 hours) Estimation principles, multi-sample The OREDA-estimator used in the OREDA data handbook is based on the following assumptions: We have k different samples. A sample may e.g., correspond to a platform, and we may have data from similar items used on k different platforms. In sample no. i we have observed ni failures during a total time in service ti, for i =1,2,…, k. Sample no. i has a constant failure rate i, for i =1,2,…, k. Due to different operational and environmental conditions, the failure rate i may vary between the samples. This variation is described by a probability density function, say ( ) The OREDA handbook presents expectation and standard deviation in the estimated distribution of () 51 Increasing hazard rate The estimation of parameters e.g., the Weibull distribution requires Maximum Likelihood procedures. Let be the parameter vector of interest, for example = [,] if the Weibull distribution is considered Let tj denote the observed life times, both censored and real life times The likelihood function is now given by L (θ; t ) F (t jC L j ; θ ) f (t j ; θ ) R (t j ; θ ) jU jC R where CL, U and CR are the set of left-censored (start of observation not known, uncensored (real lifetimes) and rightcensored life times (failure time not known) The estimator is the value of that maximizes L(;t) 52 How to do it? Usually we use numerical methods to maximize the likelihood function Such a procedure is implemented in the TTTPlot.xls file For the example data we get 3.05630302 0.000052 53 Exercise Assume that the following failure data for one component type has been recorded (in months) 8,9,7,6,12,18,14,6,9,11,24 Find the ML estimators for the parameters in the Weibull distribution Compare the parametric plot (Weibull) with the TTT plot, and judge how well the data fit the Weibull distribution 54 Likelihood function, Weibull model Assume t(1), t(2),…t(n) are ordered failure times, and I(1), I(2),…I(n) are indicators such that I(i) = 1 if failure time (i) is a failure time, and 0 if it is a censoring life time. The probability density function is given by () = −1 − The survival function is given by () = − The log likelihood function is given by , = log , = =1 ln + ln + − 1 ln − =1 55 Exercise Assume that the following failure data for one component type has been recorded (in months) 8,9,7,6,12,18,14,18*,6,9,11,24,30*,28* Find the ML estimators for the parameters in the Weibull distribution where failure times with a star (*) represent censoring failure times 56 Kaplan Meier estimator The standard TTT plot assumes that we do not have censoring failure times The Kaplan Meier estimator and corresponding plot may be used for censoring life times Let t(1), t(2),…t(n) be ordered failure times with corresponding indicator variable to indicate the real failure times Let n(i) be the number of components “at risk” just prior to t(i) and s(i) the number of “deaths” at that time The Kaplan Meier estimator is given by: = n(i)−s (i) t(i)< n(i) 57 Exercise Assume that the following failure data for one component type has been recorded (in months) 8,9,7,6,12,18,14,18*,6,9,11,24,30*,28* Construct the Kaplan Meier plot, and insert a Weibull distribution overlay curve with the parameters estimated 58 Estimation in NHPP The Non-Homogeneous Poison Process (NHPP) is a model defined by: A system is put into service at time t = 0. If the system fails, a repair is conducted and the system is put into service after a time that could be neglected The repair action set the system back to a state as good as it was immediately prior to the failure, i.e. a minimal repair. The important parameter is w(t) = ROCOF = Probability of failure in (t, t + t) divided by t 59 For the NHPP we have The rate of occurrence of failures, ROCOF = w(t) is generally not constant. The number of failures in an interval (a,b) is Poisson distributed with parameter = () − = = () ! − () The mean number of failures in an interval (a,b) is (() − ()) = () The cumulative number of failures up to time t is () = () 0 60 Properties for selected NHPP models Property Model ROCOF = w(t) W(t) System improves for System deteriorates for Average failure rate when replaced at time Power law Linear model model t-1 (1+t) Log-linear model e+t t <1 >1 -1 (e+t - e)/ 61 (t+t2/2) <0 >0 (1+/2) <0 >0 (e+ - e)/() Estimation in NHPP A NHPP observed over a period 0 a < b We have observed n failure times t1, t2,…, tn sorted in time The likelihood function, say L(,t), is now the probability that we have observed the actual failure times, i.e. t = [t1, t2,…, tn] as a function of Consider small time intervals around the observed failure times and let ti be such a small time interval following ti The likelihood function , = Pr(Exact 1 fail. in ( , + ), = 1, . . . ∩ no failures elsewhere in (, )) 62 This yields Pr(Exact 1 fail. in ( , + ) ≈ w(ti) ti Pr no failures elsewhere in , 63 = Pr(no failures in (a,t1), (t1+ Bayesian estimation In some situations we may have tacit knowledge in terms of expert knowledge Experts are typically experienced people in the project organisation By an elicitation procedure we may get the experts to state their uncertainty distribution regarding parameters of interest This uncertainty distribution is combined by data to find the final parameter estimates 64 Procedure Specify a prior uncertainty distribution of the reliability parameter, () Structure reliability data information into a likelihood function, L(;x) Calculate the posterior uncertainty distribution of the reliability parameter vector, (x) The posterior is found by (x) L(;x) (), and the proportionality constant is found by requiring the posterior to integrate to one The Bayes estimate for the reliability parameter is given by the posterior mean 65 Example: Constant failure rate = failure rate treated as a random quantity Prior expert distribution from the elicitation procedure: E() = 0.710-6 (failures / hour) SD() = 0.310-6 For mathematical simplicity, a gamma prior is used with parameters and where E = / and Var = / 2 = E/SD2 = (0.710-6)/( 0.310-6)2 = 7.78106 = E = (7.78106) (0.710-6) = 5.44 66 Example, cont Data: t = total time in service, = 525 600 hours (e.g. 60 detector years) n = 1 = number of failures observed Constant exponentially distributed failure times the number of failures in a period of length t, N(t), is Poisson distributed with parameter t The probability of observing n failures is then L(;n,t) = Pr(N(t) = n) ne-t = likelihood 67 Example, cont The posterior distribution is found by multiplying the prior distribution with the likelihood function (n) L(;n,t) () ne-t -1e- (+ n)-1e-(+t) (+ n)-1e-(+t) is recognized as a gamma distribution with new parameters ’ =+ n, and ’ = +t The Bayes estimate is given by the mean in this distribution, i.e. ˆ n t 5.44 1 7.78 10 525600 6 0.78 10 6 (MLE: 1.910-6, prior mean = 0.710-6) 68