### S1: Chapter 6 Correlation

```S1: Chapter 4
Representation of Data
Dr J Frost ([email protected])
Stem and Leaf recap
Put the following measurements into a stem and leaf diagram:
4.7 3.6 3.8 4.7 4.1 2.2 3.6 4.0 4.4 5.0 3.7 4.6 4.8 3.7 3.2
2.5 3.6 4.5 4.7 5.2 4.7 4.2 3.8 5.1 1.4 2.1 3.5 4.2 2.4 5.1
1
2
3
4
5
4
1
2
0
0
2
5
1
1
4
6
2
1
5
6 6 7 7 8 8 ?
2 4 5 6 7 7 7 7 8
2
(1)
(4)
(9)
(12)
(4)
Key:
2 | 1 means 2.1
Now find:
= 4.7 ?
= 3.6?
= 4.7 ?
= 4.05?
Back-to-Back Stem and Leaf recap
55
92
66
90
Girls
80 84 91
98 40 60
72 96 85
76 54 58
78 80 79
80
64
88
92
Boys
80 60 91 65 67
59 75 46 72 71
74 57 64 60 50
68
The data above shows the
pulse rate of boys and girls
in a school.
Comment on the results.
The back-to-back stem and
leaf diagram shows that
? tends to be
boy’s pulse rate
lower than girls’.
Girls
8
6
9 8
8 5 4 0
8 6 2 2
5
4
6
0
1
0
4
0
2
0
0
Boys
4
5
6
7?
8
9
6
0 7 9
0 0 4 5 7 8
1 2 4 5
0
1
Key: 0|4|6
Means 40 for girls and 46
for boys.
Box Plot recap
Box Plots allow us to visually represent the distribution of the data.
Minimum
Lower Quartile Median
Upper Quartile Maximum
3
15
22
Sketch
17
Sketch
Sketch
27
Sketch
Sketch
range
IQR
0
5
How is the IQR represented
in this diagram?
10
Sketch
15
20
25
30
How is the range
Sketch
represented in this diagram?
Box Plots recap
Sketch a box plot to represent the given weights of cats:
5lb, 6lb, 7.5lb, 8lb, 8lb, 9lb, 12lb, 14lb, 20lb
Minimum
5
?
Maximum
Median
?
20
0
?
8
4
Lower Quartile Upper Quartile
8
7.5
12
Sketch
?
16
12
20
?
24
Outliers
An outlier is:
0
an extreme
? value.
5
10
Outliers beyond
this point
15
20
25
30
More specifically, it’s generally when we’re 1.5
IQRs beyond the lower and upper quartiles.
(But you will be told in the exam if the rule differs from this)
Outliers
We can display outliers as crosses on a box plot.
But if we have one, how do we display the marks for the minimum/maximum?
0
5
10
15
20
25
30
Maximum point is not
an outlier, so remains
unchanged.
0
5
10
15
20
25
30
But we have points that are outliers here. This mark
becomes the ‘outlier boundary’, rather than the minimum.
Examples
Smallest values Largest values
Lower Quartile Median
Upper Quartile
0, 3
8
14
21, 27
10
?
0
5
10
15
20
25
30
Smallest values Largest values
Lower Quartile Median
Upper Quartile
3, 7
12
16
20, 25, 26
13
?
0
5
10
15
20
25
30
Exercises
Pages 58 Exercise 4B
Q2
Page 59 Exercise 4C
Q1, 2
Comparing Box Plots
Box Plot comparing house prices of Croydon and Kingston-upon-Thames.
Croydon
Kingston
£100k
£150k
£200k
£250k
£300k
£350k
£400k
£450k
“Compare the prices of houses in Croydon with those in Kingston”. (2 marks)
For 1 mark, one of:
•In interquartile range of house prices
in Kingston is greater than Croydon.
•The range of house prices in Kingston
is greater than Croydon.
?
For 1 mark:
•The median house price in Kingston was greater than that
in Croydon.
•i.e. Compare some measure of location (could be
minimum, lower quartile, etc.)
?
Bar Charts vs Histograms
Histograms
Bar Charts
• For continuous
? data.
• Data divided into (potentially
uneven) intervals.
• [GCSE definition] Frequency
given by area
? of bars.*
• No gaps between bars.
Frequency
Frequency Density
• For discrete
? data.
• Frequency given by
height
? of bars.
6
7
8
Shoe Size
9
Use this as a reason
justify use of a histogram.
1.0m 1.2m
1.4m 1.6m 1.8m
Height
* Not actually true. We’ll correct this in a sec.
Bar Charts vs Histograms
Weight (w kg)
Frequency
Frequency Density
0 < w ≤ 10
40
4
10 < w ≤ 15
6
1.2
15 < w ≤ 35
52
35 < w ≤ 45
10
Frequency Density
5
Frequency = 15?
?
?
?
?
Freq
2.6
F.D.
1
Frequency = 25?
Frequency = 30
?
2
1
10
20
Width
Frequency = 40?
4
3
Still using the ‘incorrect’
GCSE formula:
30
Height (m)
40
50
Area = frequency?
The area of each bar in fact isn’t necessarily equal to the frequency.
Actually:
∝
i.e.  =  ×
Similarly:

∝

However, we often let  = 1, so that that the ∝ becomes an =, as we
were allowed to assume at GCSE.
The key to almost every histogram question…
…This diagram!
Area
×
Frequency
For a given histogram, there’s some scaling to get from an area (whether the
total area of the area of a particular bar) to the corresponding frequency.
Once you’ve worked out this scaling, any subsequent areas you calculate can
be converted to frequencies.
Area = frequency?
There were 60 runners in a 100m race. The following histogram represents their times.
Determine the number of runners with times above 14s.
Frequency Density
5
We first find what area
represents the total frequency.
4
Total area = 15 + 9 = 24
3
Area
2
24
60
Then use this scaling along
with the desired area.
1
0
×2.5
Freq
?
9
12
Time (s)
18
Area = 4 × 1.5
Area
6
×2.5
Freq
15?
Frequency Density = Frequency ÷ Class width?
Weight (to nearest kg)
Frequency
1-2
4
3-6
3?
7-9
3?
Note the gaps!
We can use the
complete set of
information in the
first row combined
with the bar to
again work out the
correct ‘scaling’.
Frequency Density
5
4
3
2
1
0
1
2
3
Time (s)
4
5
6
7
8
9
10
May 2012
A policeman records the speed of the traffic on a busy road with a 30 mph speed limit.
He records the speeds of a sample of 450 cars. The histogram in Figure 2 represents the
results.
(a) Calculate the number of cars that were exceeding the speed limit by at least 5
mph in the sample. (4 marks)
We can make the frequency
density scale what we like.
7
M1 A1: Determine what one
small square or one large
square is worth.
6
(i.e. work out  →  scaling)
5
4
3
2
1
Area
112.5
×4
Freq
?450
M1 A1: Use this to find number
of cars travelling >35mph.
Area
22.5
?
×4
Freq
90
May 2012
A policeman records the speed of the traffic on a busy road with a 30 mph speed limit.
He records the speeds of a sample of 450 cars. The histogram in Figure 2 represents the
results.
(b) Estimate the value of the mean speed of the cars in the sample. (3 marks)
M1 M1: Use histogram to
construct sum of speeds.
30 × 12.5 + 240 × 25 + ⋯
?
450
A1 Correct value
= 28.8
?
Bro Tip: Whenever you are asked to
calculate mean, median or quartiles
from a histogram, form a grouped
frequency table. Use your scaling factor
to work out the frequency of each bar.
May 2012

Speed

10-15
12.5
30
20-30
15
240
30-35
32.5
90
35-40
37.5
30
40-45
42.5
60
Jan 2012
Bro Tip: Be careful that you
use the correct class widths!
?
14
5?
21 + 45 + 3?= 69
Jan 2008
?
?
?
?
?
M1
A1
B1
M1
= 12 runners
A1
? is continuous
Note that gaps in the class intervals!
4 / 5 = 0.8
19 / 5 = 3.8
?
53 / 10 = 5.3
...
Jun 2007
35
?
15
?
(5 x 5) +?15 = 40
Skew
Skew gives a measure of whether the values are more spread out above the
median or below the median.
mode
mode
median
Frequency
Frequency
median
mean
mean
Height
Sketch
Mode
Sketch
Median
Weight
Sketch
Mean
We say this distribution has positive
? skew.
(To remember, think that the ‘tail’ points in the positive direction)
Sketch
Mode
Sketch
Median
Sketch
Mean
We say this distribution has negative
? skew.
Skew
Remember, think what direction the ‘tail’ is likely to point.
Distribution
Skew
Salaries on the UK.
High salaries drag mean up.
So positive skew. ?
Mean >? Median
IQ
A symmetrical distribution,
i.e. no skew.
?
Mean =? Median
Heights of people in the UK
Will probably be a nice ‘bell curve’.
i.e. No skew.
?
Mean =? Median
Age of retirement
Likely to be people who retire
significantly before the median age,
but not many who retire
? significantly
after. So negative skew.
Mean <? Median
Exam Question
In the previous parts of a question you’ve calculated that the mean mark of students in
a test was  = 55.48 and  = 56.
(d) Describe the skewness of the marks of the students, giving
(2)
1st mark
Negative skew
?
2nd mark
because mean
? < median
Skew
?
Positive
skew
?
Negative
skew
? No skew
Given the quartiles and median, how
would you work out whether the
distribution had positive or negative skew?
Exam Question
1st mark
3 − 2 > 2 −?1
2nd mark
Therefore positive
? skew.
Calculating Skew
One measure of skew can be calculated using the following formula:
(Important Note: this will be given to you in the exam if required)
3(mean – median)
standard deviation
When mean > median, mean < median, and mean = median, we can see this
gives us a positive value, negative value, and 0 respectively, as expected.
Find the skew of the following teachers’ annual salaries:
£3 £3.50 £4 £7 £100
Mean = £23.50
?
Median = £4
?
Skew = 1.53 ?
Standard Deviation = £38.28
?
S1: Chapter 4 Revision!
Revision
Stem and leaf diagrams:
• Can you construct one, and write the appropriate key?
• Can you calculate mode, mean, median and quartiles?
• Can you assess skewness by using these above values?
Back-to-back stem and leaf diagrams:
• Can you construct one with appropriate key?
• Can you compare the data on each side?
1
2
3
4
5
4
1
2
0
0
2
5
1
1
4
6
2
1
5
6 6 7 7 8 8
2 4 5 6 7 7 7 7 8
2
= 4.7 ?
Key:
2 | 1 means
? 2.1
= 3.6?
= 4.7 ?
Type of skew:
?
(1)
(4)
(9)
(12)
(4)
= 4.05?
Reason: 3 − 2 >? 2 − 1
Revision
Girls
8
6
9 8
8 5 4 0
8 6 2 2
5
4
6
0
1
0
4
0
2
0
0
Boys
4
5
6
7
8
9
6
0 7 9
0 0 4 5 7 8
1 2 4 5
0
1
The data above shows the
pulse rate of boys and girls
in a school.
Comment on the results.
Boy’s pulse rate tends to be
?
lower than girls’.
Notice the values
go outwards
from the centre.
Key: 0|4|6 ?
Means 40 for girls and 46
for boys. ?
Revision
Histograms
Can you:
• Appreciate that the frequency density scale doesn’t matter. This is why frequency is
only proportional to area, and not equal to it.
×
• You often need to identify the scaling  .
You might only be given the total frequency (in which case you need to find the total
area of the histogram to find ).
But if you know the frequency associated with a particular bar, just find the area of
that single bar.

• If you don’t care about the scaling, then   =  ℎ
• Be incredibly careful about class widths (i.e. widths of boxes). If the class interval in
the frequency table was 20 − 25 with gaps, then you’d draw 19.5 − 25.5 on the
histogram, and use 6 as the width of the box.
• If you want to find the quartiles/median/mean, you need to first construct a grouped
frequency table using the histogram.
• When asked to find the number of people with values in a certain range (e.g. with
times between 10 and 15s) and it crosses multiple ranges/bars, it’s easier to use the
frequency table you’ve constructed from the histogram. Use linear interpolation
where necessary.
Revision
?
?
?
?
?
M1
A1
B1
M1
= 12 runners
A1
Revision
Given that an outlier is a value 1.5 ×  outside the lower and upper quartiles…
Smallest values Largest values
Lower Quartile Median
Upper Quartile
0, 3
8
14
21, 27
10
?
0
5
10
15
20
25
30
Smallest values Largest values
Lower Quartile Median
Upper Quartile
3, 7
12
16
20, 25, 26
13
?
0
5
10
15
20
25
30
Revision
Skewness
You can determine skewness in three ways:
• Comparing quartiles:
When 3 − 2 > 2 − 1 , the width of the right box in the box plot is wider, so it’s positive skew.
If a box plot is drawn, it should be immediately obvious!
• Comparing mean/median:
When  > , large values have dragged up the mean, so there’s a tail in the positive direction,
and thus the skew is positive.
• Looking at the shape of the distribution. If there’s a ‘positive tail’, the skew is positive.
When asked to justify your answer for skewness, you’re expected to put either something like “3 − 2 >
2 − 1 ” or " > “.
You will always be given a formula if you have to calculate a value for skew. But for all formulae, 0 means no
skew (i.e. a “symmetric distribution”), >0 means positive skew and <0 means negative skew.
Find the skew of the following teachers’ annual salaries:
£3 £3.50 £4 £7 £100
Mean = £23.50
?
Median = £4
?
3  −
=

Standard Deviation = £38.28
?
Skew = 1.53 ?
```