### Data Handling II - KEATS

```Drug Development Statistics & Data Management
Data Handling II:
Dr Yanzhong Wang
Lecturer in Medical Statistics
Division of Health and Social Care Research
King's College London
Email: [email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */
Types of data
• Quantitative data
– continuous, discrete
– distributions may symmetric or skewed
• Qualitative (categorical) data
– binary
– nominal, ordinal
2
Skewed Distributions
Pos itively ske we d da ta
Negatively Skewed data
25
30
20
Fre q u e n cy
Frequency
25
20
15
15
10
10
5
5
0
0
Long tail to left
Long tail to right
3
Symmetric Distribution
.4
.3
.2
.1
0
0
2
4
6
4
Summary statistics
• ‘Where the data are’ - location
– mean, median, mode, geometric mean
• Used to describe baseline data and main
outcomes
• ‘How variable the data are’ - spread
– standard deviation, variance, range, interquartile
range, 95% range
• Needed (primarily) to describe baseline data
in RCT and cohort study
5
Definition of the Mean
The mean of a sample of values is the
arithmetic average and is determined by
dividing the sum of the values by the
number of the values.
6
Definition of the Median
The median is the middle value.
not affected by skewness and outliers, but less
precise than mean theoretically.
7
Ordered Blood Glucose Values
2.2
3.6
3.8
4.2
4.7
2.9
3.6
3.9
4.3
4.7
3.3
3.6
4.0
4.4
4.8
3.3
3.6
4.0
4.4
4.9
3.3
3.7
4.0
4.4
4.9
3.4
3.7
4.1
4.5
5.0
3.4
3.8
4.1
4.6
5.1
3.4
3.8
4.1
4.7
6.0
8
Definition of the Mode
The mode is the most frequent value.
9
Ordered Blood Glucose Values
2.2
3.4
3.8
4.1
4.4
4.7
5.0
3.4
3.8
4.1
4.4
4.7
3.4
3.8
4.1
4.4
4.7
2.9
3.6 3.6 3.6 3.6
3.9
4.2
4.5
4.8
5.1
3.3
3.7
4.0
4.3
4.6
4.9
6.0
3.3 3.3
3.7
4.0 4.0
4.9
10
Location = Central Tendency
Mode - not necessarily central (categorical data)
Median - only uses relative magnitudes
7
6
Arithmetic Mean - outlier prone
5
Count
4
3
2
1
0
2
3
4
5
6
Blood glucose (mmol/litre)
11
Relation of mean, median and mode
• If distribution is unimodal (has only one
mode) then:
• Mean=median=mode for symmetric
distribution.
• Mean>median>mode for positively skewed
distribution.
• Mean<median<mode for negatively skewed
distribution.
12
Serum Triglyceride Levels from Cord
Blood of 282 Babies
80
70
60
Count
50
40
30
20
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Serum Triglyceride Levels
13
Log(Serum Triglyceride Levels) from Cord
Blood of 282 Babies
35
30
count
25
20
15
10
5
0
-1.9
-1.7
-1.5
-1.3
-1.1
-0.9
-0.7
-0.5
-0.3
log(Serum Triglyceride) Levels
-0.1
0.1
0.3
0.5
14
Definition of the Geometric Mean
The geometric mean of a sample of n values
is determined by multiplying all the values
together and taking the nth root (for only
two values this is the more familiar square
root).
15
Geometric Mean
•
A common example of when the geometric
mean is the correct choice average is when
averaging growth rates.
•
Another Method: Take log of each value,
find arithmetic mean and anti-log the result.
Exp( (log(0.15) + … + log(1.66) )/40) = 0.467
Serum Triglyceride Levels from Cord
Blood of 282 Babies
Median=0.460
80
70
Geometric Mean=0.467
Mean=0.506
60
Count
50
40
30
20
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Serum Triglyceride Levels
17
Why measures of variability are
important
Production of Aspirin
• New production process of 100 mg tabs
• Random sample from process
– 96 97 100 101 101 mgs - mean 99 mg
• Random sample from old process
– 88 93 100 104 110 mgs - mean 99 mg
• Same means but new is better because less
variable
18
Definition of Range
The range of a sample of values is the largest
value minus the smallest value.
• New process the range is 101-96=5
• Old process the range is 110-88=22
• Range is simple ….. BUT
– Only uses min and max
– Gets larger as sample size increases
19
Definition of Inter-quartile Range
The inter-quartile range of a sample of values is
the difference between the upper and lower
quartiles. The lower quartile is the value which is
greater than ¼ of the sample and less than ¾ of
the sample. Conversely, the upper quartile is the
value which is greater than ¾ of the sample and
less than ¼ of the sample.
20
Ordered Blood Glucose Values
1/4 of 40 = 10
3/4 of 40 = 30
2.2
3.6
3.8
4.2
4.7
2.9
3.6
3.9
4.3
4.7
3.3
3.6
4.0
4.4
4.8
3.3
3.6
4.0
4.4
4.9
3.3
3.7
4.0
4.4
4.9
3.4
3.7
4.1
4.5
5.0
3.4
3.8
4.1
4.6
5.1
3.4
3.8
4.1
4.7
6.0
21
Inter-Quartile Range
Inter-quartile range
7
6
Upper quartile
Lower quartile
5
Count
4
3
2
1
0
2
3
4
5
6
Blood glucose (mmol/litre)
22
Standard deviation
• Neither measure uses the numerical values - only
relative magnitudes
• A measure accounting for the values is the
standard deviation
• Consider the aspirin data from the new process
96 97 100 101 101 (mean 99 mg)
• Determine deviations from mean
-3 -2
1
2
2
• Square , add, average and square-root
9  4  1 4  4
5

4.4  2.098
23
Measures of scatter/dispersion – ‘how
variable the data are’
• Range – smallest to biggest value
– increases with sample size
• Standard deviation – measure of variation
around the mean
– affected by skewness and outliers
• Variance = square of standard deviation
• Interquartile range (IQR) – from 25th centile
to 75th centile
24
Plotting Data
• Histograms
• Stem and Leaf Plots
6
1
2
3
4
4
2
6
4
6
3
3
1
1
Blood glucose (mmol/litre)
Stem Leaf
60 0
58
56
54
52
50 00
48 000
46 0000
44 0000
42 00
40 000000
38 0000
36 000000
34 000
32 000
30
28 0
26
24
22 0
----+----+----+----+
Multiply Stem.Leaf by 10**-1
Box Plots
5
4
3
2
25
Mean and standard deviation
• Best description if distribution reasonably
symmetric (and single mode)
• Give full description if data have Normal
distribution
26
.4
Mean 3, s.d. 1
Mean 5, s.d. 1
.3
.2
Mean 5, s.d. 2
.1
0
0
1
2
3
4
5
x
6
7
8
9
10
27
Properties of Normal distribution
• Symmetric distribution – mean, median and
mode equal
• Completely specified by mean and standard
deviation
• 95% of distribution contained within mean 
1.96 standard deviations
• 68% within mean  1 standard deviation
28
Continuous data,
not Normally distributed
• If symmetric use mean and standard deviation
• If skewed use median and IQR
Unless
• Positively skewed, but log transformation
creates symmetric distribution – use
geometric mean
29
Nominal categorical data
• Mode.
• % in each category, especially when binary.
Wheeze in last 12
months
Frequency (n)
%
No
1945
75.2
Yes
642
24.8
2587
100.0
Total
30
Ordinal categorical data
• Median and IQR if enough separate values.
• Otherwise as for nominal.
31
Discrete quantitative data
• As for continuous data if many values, as for
ordinal data if fewer.
Difference Between
Standard Deviation & Standard Error
33
Measure of Variability of the Sample Mean
• Range, inter-quartile range and standard
deviation relate to population (sample)
not mean.
• To understand the difference carry out a
sampling experiment using the Ritchie
Index values
34
Values of the Ritchie Index (Measure of
Joint Stiffness) in 50 Untreated Patients
14 9 8 9 1 20 3 3 2 4
2 3 6 1 2 11 16 24 16 21
19 22 33 12 12 12 19 10 33 2
19 40 1 20 1 2 4 7 9 4
9 6 14 8 27 10 27 7 24 21
Mean = (14+…+21)/50 = 12.18
35
Location = Central Tendency
16
Mode - not necessarily central (categgorical data)
14
Median - only uses relative magnitudes
12
Arithmetic Mean - outlier prone
10
8
6
4
2
0
36
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Sampling Experiment
• Take a random sample (10) from the 50
values
• Determine the mean of the 10 values
• Repeat 50 times
• These means show variation - HOW
LARGE IS IT ?
37
Variations in Samples
16
Mean=12.18
14
12
10
8
6
4
2
0
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Mean=10.00
16
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
16
14
12
0
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
16
Mean=12.60
14
10
8
8
6
6
4
4
2
2
0 -38
5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Mean=11.50
12
10
0
Mean=13.40
16
0
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Ritchie Values
Original values
(mean - 12.18 ; sd - 9.69)
30
25
20
15
10
5
0
39
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Ritchie Values
Sampling Experiment – Sample Means
30
25
Original values
(mean - 12.18 ; sd - 9.69)
Sample means
(mean - 12.21 ; sd - 2.97)
20
15
10
5
0
40
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
Definition of the Standard Error
The standard deviation of the sampling
distribution of the mean is called the
standard error of the mean.
41
Increasing Sample Size
40
n=10
40
35
35
30
30
Sample means
(mean - 12.21 ; sd - 2.97)
25
Sample means
(mean - 12.37 ; sd - 2.43)
25
20
20
15
15
10
10
5
5
0
n=15
0
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
0-5
6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Values of the Ritchie Index
• Increased precision (smaller standard error)
• Less skewness
42
Standard error of the mean as a
function of the sample size
10
Standard Error of the Mean
9
sd  
8
7
se   /
6
5
n
4
3
2
1
0
0
10
20
Sample Size
43
30
40
1000 1500 2000 2500 3000
0
500
Frequency
Population of Gene Lengths
n=20,290
0
44
5000
10000
Gene Length (# of nucleotides)
15000
150
100
0
50
Frequency
200
250
300
Samples of size : n=100
0
45
5000
10000
Gene Length (# of nucleotides)
15000
Practical Confusion
• A mean is often reported in medical
papers as
12.18  1.37
what is 1.37 ?
sd or se ?
46
Thanks!
Tea break
```