### correlations

```Statistics
Correlation and regression
Introduction

Some methods involve one variable


is Treatment A as effective in relieving arthritic
pain as Treatment B?
Correlation and regression used to
investigate relationships between variables

most commonly linear relationships

between two variables

is BMD related to dietary calcium level?
2
Contents

Coefficients of correlation





meaning
values
role
significance
Regression



line of best fit
prediction
significance
3
Introduction

Correlation


Regression analysis


the strength of the linear relationship between two
variables
determines the nature of the relationship
Is there a relationship between the number of
units of alcohol consumed and the likelihood of
developing cirrhosis of the liver?
4
Pearson’s coefficient of correlation


r
Measures the strength of the linear relationship
between one dependent and one independent
variable


curvilinear relationships need other techniques
Values lie between +1 and -1



perfect positive correlation r = +1
perfect negative correlation r = -1
no linear relationship r = 0
5

r = +1

Pearson’s coefficient of correlation
r = -1









r=0









r = 0.6




6
Scatter plot


BMD
dependent variable





Calcium intake
independent variable
make inferences from
controlled in some cases
7
Non-Normal data
8
Normalised
9
Calculating r

The value and significance of r are calculated by
SPSS
10
SPSS output: scatter plot
11
SPSS output: correlations
12
Interpreting correlation

Large r does not necessarily imply:

strong correlation


r increases with sample size
cause and effect



strong correlation between the number of
televisions sold and the number of cases of
paranoid schizophrenia
watching TV causes paranoid schizophrenia
may be due to indirect relationship
13
Interpreting correlation

Variation in dependent variable due to:






relationship with independent variable: r2
random factors: 1 - r2
r2 is the Coefficient of Determination
e.g. r = 0.661
r2 = = 0.44
less than half of the variation in the dependent
variable due to independent variable
14
15
Agreement

Correlation should never be used to determine
the level of agreement between repeated
measures:




measuring devices
users
techniques
It measures the degree of linear relationship

1, 2, 3 and 2, 4, 6 are perfectly positively correlated
16
Assumptions


Errors are differences of predicted values of Y
from actual values
To ascribe significance to r:


distribution of errors is Normal
variance is same for all values of independent
variable X
17
Non-parametric correlation



Make no assumptions
Carried out on ranks
Spearman’s r


Kendall’s t




easy to calculate
distribution has better statistical properties
easier to identify concordant / discordant pairs
Usually both lead to same conclusions
18
Calculation of value and significance

Computer does it!
19
Role of regression


Shows how one variable changes with another
By determining the line of best fit


linear
curvilinear
20
Line of best fit


Simplest case linear
Line of best fit between:


dependent variable Y
 BMD
independent variable X
 dietary intake of Calcium








Y = a + bX
value of Y when X=0 change in Y when X increases by 1
21
Role of regression

Used to predict



the value of the dependent variable
when value of independent variable(s) known
within the range of the known data



extrapolation risky!
relation between age and bone age
Does not imply causality
22
SPSS output: regression
23
Assumptions

Only if statistical inferences are to be made


significance of regression
values of slope and intercept
24
Assumptions


If values of independent variable are randomly
chosen then no further assumptions necessary
Otherwise

as in correlation, assumptions based on errors




balance out (mean=0)
variances equal for all values of independent variable
not related to magnitude of independent variable
25
Multivariate regression
 More than one independent variable

BMD dependent on:




age
gender
calorific intake
etc
26
Logistic regression

The dependent variable is binary


yes / no
predict whether a patient with Type 1 diabetes
will undergo limb amputation given history of
prior ulcer, time diabetic etc


result is a probability
Can be extended to more than two
categories

Outcome after treatment

recovered, in remission, died
27
Summary

Correlation





strength of linear relationship between two variables
Pearson’s - parametric
Spearman’s / Kendalls non-parametric
Interpret with care!
Regression




line of best fit
prediction
multivariate
logistic
28
```