Lecture 10

```Transformations
Getting normal or using the
linear model
Two Reasons to Transform
• Variables do not fit a normal
distribution and parametric tests are
desired
• A relationship between two
variables is non-linear but
transformation would allow the use
of linear regression
Non-Normal Data
• Reasons real data can fail to follow a
normal distribution:
– Errors in measurement are multiplicative
rather than additive, e.g. ± 2% rather than
± 2mm
– Constraints on the dimensions of an
artifact feature are not symmetrical, e.g.
point length must exceed haft length but
can be as long as the material allows
Non-Normal Data 2
– Measurements are products rather
than sums of other measurements, e.g.
area, volume
– Counts follow binomial, poisson, or
negative binomial distributions which
are often asymmetrical unless sample
sizes are large
Solutions
• Use non-parametric methods that do
not depend on the normality of the
data (increasingly easy to do)
• Use data transformations that shift
the distribution to one that is normal
Transformation
• The goal is to change the spacing of
the data to compress a long tail and
draw out a flat tail
• The transformation must preserve
the order of the original data – we
only change the spacing between
data points
Transformation
• Right skewed data with many zeros
cannot be transformed effectively
since nothing can stretch out
observations that have the same
value – e.g. artifact counts by site,
grid square are often poisson
distributed with many zeros
An Example
• Using the DartPoints data set, we
saw that Length was asymmetrical
• Plot the kernel density of Length
with and without a log scale to see
the difference
• To transform Length we would use
– logLength <- log(DartPoints\$Length)
plot(density(DartPoints\$Length), main="Dart Point Length",
xlab="Normal scale")
plot(density(DartPoints\$Length), main="Dart Point Length",
xlab="Log scale", log="x")
Common Transformations
• Tail to the right
– Natural or common (base 10) logarithm
– no zero values
– Square root, cube root, etc – zeros ok
– Inverse, -1/x, -1/x2, etc – no zero values
• Tail to the left
– Exponential ex,10x (low values)
– Square, cube, etc
Other Transformations
• arctangent (inverse tangent) to
handle values between 0 and 1 used
for population studies of non-metric
traits
Transforming to Linear
• By transforming variables before
using linear regression we can fit
nonlinear equations
• In some cases we can express the
fitted equation in terms of the
original untransformed variables
Polynomial
• Y = a + b1x + b2x2 + b3x3 + b4x4 . . .
• Create polynomial values or use the
function poly() within lm()
• Begin with linear and then work up
to quadratic, cubic, and so on until
the new terms are not significant
• Eg. lm(y~x+I(x^2)+I(x^3))
Power Function
• Log-log transformation
• Use log() to transform dependent
and independent variables
• Compute linear regression
– log(y) = a + b * log(x)
– y = Axb
(where A= exp(a))
• If b = 1, same as the linear model
• x, y > 0
Exponential function
• Semi-log transformation
• Use log() to transform dependent
variable, y > 0
• Compute linear regression
– log(y) = a + b * x
– y = Aebx
(where A= exp(a))
• Fits data with asymptotes
Inverse Function
• Reciprocal transformation – 1/x
where x ≠ 0
• Used for distance models –
declines with distance
• Fits data with asymptotes
Other Functions
• Logarithmic – no zeros in x
– y = a + b * log(x)
• Square Root – no negative values in
x
– y = a + b * sqrt(x)
Examples
• Human cranial capacity over the last
1.8 million years
• Number of Identified Specimens
(NISP) and Minimum Number of
Individuals (MNI) at Chucalissa
(Middle Misssissippian site)
# BrainsCC.RData
# Explore logs with scatterplot
RegModel.1 <- lm(BrainCC~AgeKa, data=BrainsCC)
# Rcmdr
summary(RegModel.1)
# Rcmdr
BrainsCC\$logAge <- with(BrainsCC, log(AgeKa))
# Rcmdr
BrainsCC\$logBrain <- with(BrainsCC, log(BrainCC))
# Rcmdr
RegModel.2 <- lm(logBrain~logAge, data=BrainsCC)
# Rcmdr
summary(RegModel.2)
# Rcmdr
RegModel.3 <- lm(BrainCC~logAge, data=BrainsCC)
# Rcmdr
summary(RegModel.3)
# Rcmdr
plot(BrainCC~AgeKa, data=BrainsCC, pch="+")
abline(RegModel.1, lty=1, lwd=2, col="black")
x <- seq(0, 1800, 10)
logx <- log(x)
lines(x, exp(predict(RegModel.2, data.frame(logAge=logx))), lty=1,
lwd=2, col="red")
lines(x, predict(RegModel.3, data.frame(logAge=logx)), lty=1, lwd=2,
col="blue")
legend("topright", c("Linear", "Power", "Logarithmic"), lty=1, lwd=2,
col=c("black", "red", "blue"))
LinearModel.4 <- lm(BrainCC ~ AgeKa + I(AgeKa^2), data=BrainsCC)
summary(LinearModel.4)
LinearModel.5 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3),
data=BrainsCC)
summary(LinearModel.5)
LinearModel.6 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3) +
I(AgeKa^4), data=BrainsCC)
summary(LinearModel.6)
plot(BrainCC~AgeKa, data=BrainsCC, pch="+")
abline(RegModel.1, lty=1, lwd=2, col="black")
x <- seq(0, 1800, 10)
lines(x, predict(LinearModel.4, data.frame(AgeKa=x)), lty=1, lwd=2,
col="red")
lines(x, predict(LinearModel.5, data.frame(AgeKa=x)), lty=1, lwd=2,
col="blue")
lines(x, predict(LinearModel.6, data.frame(AgeKa=x)), lty=1, lwd=2,
col="green")
lty=1, lwd=2, col=c("black", "red", "blue", "green"))
plot(mni~nisp, data=Chucalissa)
RegModel.1 <- lm(mni~nisp, data=Chucalissa) #Rcmdr
summary(RegModel.1) #Rcmdr
abline(RegModel.1)
plot(mni~nisp, data=Chucalissa, log="xy") # Plot log-log transform
plot(mni~nisp, data=Chucalissa, log="y") # Plot semi-log transform
Chucalissa\$logMNI <- log(Chucalissa\$mni) # Create logged variables
Chucalissa\$logNISP <- log(Chucalissa\$nisp)
plot(logMNI~logNISP, data=Chucalissa)
RegModel.2 <- lm(logMNI~logNISP, data=Chucalissa) #Rcmdr
summary(RegModel.2) #Rcmdr
abline(RegModel.2)
plot(mni~nisp, data=Chucalissa) # plot log-log equation on original data
a2 <- exp(RegModel.2\$coefficients[[1]]) # Convert a to exp(a)
b2 <- RegModel.2\$coefficients[[2]]
a1 <- RegModel.1\$coefficients[[1]]
b1 <- RegModel.1\$coefficients[[2]]