Lecture 12 - 13 GLM - Animal Science Computer Labs

Report
Use of Proc GLM to Analyze
Experimental Data
Animal Science 500
Lecture No.
October , 2010
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM
 The
GLM procedure uses the method of least
squares to fit general linear models.
 Among
the statistical methods available in
PROC GLM are:





Regression,
Analysis of variance,
Analysis of covariance,
Multivariate analysis of variance (MANOVA),
and partial correlation.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM
 PROC
GLM analyzes data within the framework
of general linear models.
 PROC
GLM handles models relating one or
several continuous dependent variables to one
or several independent variables.


The independent variables can be either classification
variables, which divide the observations into discrete
groups, or continuous variables.
Thus, the GLM procedure can be used for many
different analyses, including the following:
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

Thus, the GLM procedure can be used for many
different analyses, including the following:










simple regression
multiple regression
analysis of variance (ANOVA), especially for unbalanced data
analysis of covariance
response surface models
weighted regression
polynomial regression
partial correlation
multivariate analysis of variance (MANOVA)
repeated measures analysis of variance
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

PROC GLM enables you to specify any degree of
interaction (crossed effects) and nested effects.

It also provides for polynomial, continuous-by-class, and
continuous-nesting-class effects.

Through the concept of estimability, the GLM
procedure can provide tests of hypotheses for the
effects of a linear model regardless of the number of
missing cells or the extent of confounding.

PROC GLM displays the sum of squares (SS)
associated with each hypothesis tested and, upon
request, the form of the estimable functions employed
in the test. PROC GLM can produce the general form of
all estimable functions.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

The REPEATED statement enables you to specify
effects in the model that represent repeated
measurements on the same experimental unit for the
same response, providing both univariate and
multivariate tests of hypotheses.

The RANDOM statement enables you to specify
random effects in the model; expected mean squares
are produced for each Type I, Type II, Type III, Type IV,
and contrast mean square used in the analysis. Upon
request, tests that use appropriate mean squares or
linear combinations of mean squares as error terms
are performed.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

The ESTIMATE statement enables you to specify an
vector for estimating a linear function of the
parameters .

The CONTRAST statement enables you to specify a
contrast vector or matrix for testing the hypothesis
that . When specified, the contrasts are also
incorporated into analyses that use the MANOVA and
REPEATED statements.

The MANOVA statement enables you to specify both
the hypothesis effects and the error effect to use for a
multivariate analysis of variance.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

PROC GLM can create an output data set containing
the input data set in addition to predicted values,
residuals, and other diagnostic measures.

PROC GLM can be used interactively. After you specify
and fit a model, you can execute a variety of
statements without recomputing the model parameters
or sums of squares.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
PROC GLM

For analysis involving multiple dependent variables
but not the MANOVA or REPEATED statements, a
missing value in one dependent variable does not
eliminate the observation from the analysis for other
dependent variables. PROC GLM automatically groups
together those variables that have the same pattern of
missing values within the data set or within a BY
group. This ensures that the analysis for each
dependent variable brings into use all possible
observations.
IOWA STATE UNIVERSITY
Department of Animal Science
SAS/STAT(R) 9.22 User's Guide
Estimable Function
 Often
see an error in SAS non-est.
 What
does this mean?
IOWA STATE UNIVERSITY
Department of Animal Science
Estimability
 Generalized
inverses are used to obtain
solutions for effects in general linear models.



There are many generalized inverses.
Many different sets of solutions are possible.
Estimable are unique and don’t depend on the
generalized inverse used to obtain solutions.
 To
analyze data properly, that is answer the
hypothesis being tested, the scientist should
know what function of the parameters in the
model are being estimated.
IOWA STATE UNIVERSITY
Department of Animal Science
Estimability
 The
hypothesis being tested is NOT the
absolute values for a level of a factor in the
model.
 Usually
asking or hypothesizing that two
means are different or some treatment is
different from a control.
 Hence
the differences are estimable function
NOT the values (solutions) for any of the
functions.
IOWA STATE UNIVERSITY
Department of Animal Science
The General Linear Model
 The
main effects general linear model can be
parameterized as
Yij = µ + αi + bj + εij
Where
Y observation for ith α,
µ is the overall mean (unknown fixed parameter),
αi effect of the ith value of α (αi - µ),
bj effect of the jth value of b (bj - µ), and
εij is the experimental error N(0,δ2)
IOWA STATE UNIVERSITY
Department of Animal Science
The General Linear Model
 In
matrix terminology, the general linear model
may be expressed as
Y
= Xβ + ε
where
Y the observed data vector,
X the design matrix,
β is a vector of unknown fixed effect
parameters, and
ε is the vector of errors
IOWA STATE UNIVERSITY
Department of Animal Science
Programming the General Linear Model
 In
the GLM procedure, one saves the data set
plus the residuals, predicted values, and
studentized residuals with an output statement
in a data set called resdat.
PROC GLM;
class machine operator;
Model yield=machine|operator;
output out=resdat r=resid p=pred
student=stdres rstudent=rstud
IOWA
STATE Uh=lev;
NIVERSITY
cookd=cksd
Department of Animal Science
Assumptions of the general linear model
E
(ε) = 0
 var(ε)
= σ2 I
 var(Y)
= σ2 I
 E(Y
) = Xβ
IOWA STATE UNIVERSITY
Department of Animal Science
Assumptions of the Linear Regression Model
1.
2.
3.
4.
5.
6.
7.
8.
9.
Linear Functional form
Fixed independent variables
Independent observations
Representative sample and proper specification of the
model (no omitted variables)
Normality of the residuals or errors
Equality of variance of the errors (homogeneity of
residual variance)
No multicollinearity
No autocorrelation of the errors
No outlier distortion
IOWA STATE UNIVERSITY
Department of Animal Science
Explanation of the Assumptions
1.
Linear Functional form

2.
The Observations are Independent observations


3.

Heteroskedasticity precludes generalization and external validity
This too distorts the significance tests being used
Multicollinearity (many of the traits exhibit collinearity)


6.
Permits proper significance testing similar to ANOVA and other statistical procedures
Equal variance (or no heterogenous variance)

5.
Representative sample from some larger population
If the observations are not independent results in an autocorrelation which inflates the
t and r and f statistics which in turn distorts the significance tests
Normality of the residuals

4.
Does not detect curvilinear relationships
Biases parameter estimation.
Can prevent the analysis from running or converging (getting your answers)
Severe or several outliers will distort the results and may bias the
results.

If outliers have high influence and the sample is not large enough, then they may
serious bias the parameter estimates
IOWA STATE UNIVERSITY
Department of Animal Science
SAS test for residual normality
Proc univariate data=resdat normal plot;
var resid;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
Graphically examining residuals for
homogeneity
Proc gplot data=resdat;
plot resid * pred;
Run;
Quit;
Analysis for lack of pattern;
IOWA STATE UNIVERSITY
Department of Animal Science
Testing for outliers
Proc freq data=resdat;
tables stdres cksd;
Run;
Quit;
1. Look for standardized residuals greater than
3.5 or less than – 3.5
2. And look for high Cook’s D (greater than
4*p/(n-p-1).
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement
 Variables
included in the CLASS statement
referred to as class variables.
 Specifies
the variables whose values define the
subgroup combinations for the analysis.

Represent various level of some factors or effects








Treatment (1,….n)
Season (spring, summer, fall, and winter coded 1 through 4)
Breed
Color
Sex
Line
Day
Laboratory
IOWA STATE UNIVERSITY
Department of Animal Science
Evaluating outliers
1.Check coding to spot typos
2. Correct typos
3. If observational outlier is correct,
Examine the dffits option to see determine how much
influence the outlier has on the fitting statistics.
This will show the standardized influence of the
observation on the fit.
If the influence of the outlier is bad, then consider
removal making it a missing observation ( . )
IOWA STATE UNIVERSITY
Department of Animal Science
Getting started with GLM
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Syntax
PROC GLM <options> ;
CLASS variables </ option> ;
MODEL dependent-variables=independent-effects
</ options> ;
IOWA STATE UNIVERSITY
Department of Animal Science
Positional Requirements for PROC GLM Statements
Statement
ABSORB
BY
CLASS
CONTRAST
ESTIMATE
FREQ
ID
LSMEANS
MANOVA
Must Precede...
First RUN statement
First RUN statement
MODEL statement
MANOVA,
REPEATED,
or RANDOM
statement
First RUN statement
First RUN statement
MODEL statement
CONTRAST or
MODEL statement
MODEL statement
CONTRAST,
ESTIMATE,
LSMEANS, or
MEANS
statement
OUTPUT
RANDOM
REPEATED
TEST
MANOVA or
REPEATED
statement
IOWA STATE UNIVERSITY
WEIGHT
Department of Animal Science
MODEL statement
MODEL statement
MEANS
MODEL
Must Follow...
First RUN statement
CLASS statement
MODEL statement
CONTRAST or
MODEL statement
CONTRAST,
MODEL,
or TEST statement
MODEL statement
Statements in the GLM Procedure
Statement
ABSORB
BY
CLASS
CONTRAST
ESTIMATE
FREQ
ID
LSMEANS
MANOVA
MEANS
MODEL
OUTPUT
RANDOM
REPEATED
STORE
TEST
IOWA STATE UNIVERSITY
WEIGHT
Department of Animal Science
Description
Absorbs classification effects in a model
Specifies variables to define subgroups for the
analysis
Declares classification variables
Constructs and tests linear functions of the
parameters
Estimates linear functions of the parameters
Specifies a frequency variable
Identifies observations on output
Computes least squares (marginal) means
Performs a multivariate analysis of variance
Computes and optionally compares arithmetic
means
Defines the model to be fit
Requests an output data set containing
diagnostics for each observation
Declares certain effects to be random and
computes expected mean squares
Performs multivariate and univariate repeated
measures analysis of variance
Requests that the procedure save the context
and results of the statistical analysis into an
item store
Constructs tests that use the sums of squares
for effects and the error term you specify
Specifies a variable for weighting observations
Class Variables
 Are
usually things you would like to account for
in your model
 Can
be numeric or character
 Can
be continuous values
 They
are generally not used in regression
analyses

What meaning would they have
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement Options

Ascending
sorts class variable in ascending order

Descending
sorts class variable in descending order
Other options with the Class statement generally related to the
procedure (PROC) being used and thus will not cover them all
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
A
discrete variable is one that cannot take on
all values within the limits of the variable.



Limited to whole numbers
For example, responses to a five-point rating scale can
only take on the values 1, 2, 3, 4, and 5.
The variable cannot have the value 1.7. A variable such
as a person's height can take on any value.
Discrete variables also are of two types:
1.
2.
unorderable (also called nominal variables)
orderable (also called ordinal)
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 Data
sometimes called categorical as the
observations may fall into one of a number of
categories for example:

Any trait where you score the value



Lameness scores
Body condition scores
Soundness scoring
 Reproductive
 Feet and leg

Behavioral traits
 Fear test
 Back test
 Vocal scores

Body lesion scores
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 When
do discrete variables become continuous
or do they?
 What
is a trait like number born alive considered
discrete or continuous?
IOWA STATE UNIVERSITY
Department of Animal Science
Example Variables
Data:
The dependent variable (what is being
measured) is aerial biomass
and there are five substrate measurements:
(These are the independent variables)
1.
2.
3.
4.
Salinity,
Acidity,
Potassium,
Sodium, and Zinc.
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
a
covariate is a independent variable that
contribute variation to the dependent variable
of interest.
 The
research wants to account for the
covariate differences that occurs for each
observation.
A
covariate may be of direct interest or it may
be a confounding or interacting type of
variable
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Examples
Weight of animal at measurement
Age of animal at measurement
Age of animal at weaning
Parity of sow for number born alive and
weaning weight
Days of lactation for milk weight
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Covariate
may influence the dependent
variable in the following ways



Linear covariate
Quadratic covariate
Cubic covariate
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Check
 If
to be sure your covariate is significant
the linear is significant, test the quadratic
 If
the linear and quadratic are significant
sources of variation test the cubic
 How
do you do that?
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 How




do you do that?
Linear include the variable name in the model not listed
in the class statement.
Example weight
Quadratic the variable name is included as follows
weight*weight
Cubic the variable name is included as follows
weight*weight*weight
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Covariate
may influence the dependent
variable in the following ways

Linear covariate


Quadratic covariate



Independent covariate affects the dependent variable in a linear
manner
Independent covariate affects the dependent variable in a linear
quadratic manner
Indicates there is an inflection point (and only one)
Cubic covariate


Independent covariate affects the dependent variable in a linear
cubic manner
Indicates there are two inflection points
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Covariate
may influence the dependent
variable in the following ways

Linear covariate


Independent covariate affects the dependent variable in a linear
manner
Dependent variable increase or decreases at a constant
rate
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates
 Covariate
may influence the dependent
variable in the following ways

Quadratic covariate




Independent covariate affects the dependent variable in a linear
quadratic manner
Indicates there is an inflection point (and only one)
The dependent variable increases (or decreases) to
some point and then either increases at an increasing
rate (decreases at an increasing rate) or increases at a
decreasing rate (or decreases at a decreasing rate)
Or could be a directional change – to some point the
dependent variable increases and then after another
point the dependent variable response decreases or
vise versa
IOWA STATE UNIVERSITY
Department of Animal Science
Covariates

Cubic covariate





Independent covariate affects the dependent variable in a linear
cubic manner
Indicates there are two inflection points
Essentially the same as quadratic but the changes can
occur at an additional point
The dependent variable increases (or decreases) to
some point and then either increases at an increasing
rate (decreases at an increasing rate) or increases at a
decreasing rate (or decreases at a decreasing rate)
Or could be a directional change – to some point the
dependent variable increases and then after another
point the dependent variable response decreases or
vise versa
IOWA STATE UNIVERSITY
Department of Animal Science
Model Development and Selection of Variables
Example:
The general problem addressed is to identify
important soil characteristics influencing aerial
biomass production of marsh grass, Spartina
alterniflora.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data Origination
(Dr. P. J. Berger)
Data: The data were published as an exercise
by Rawlings (1988) and originally appeared as a
study by Dr. Rick Linthurst, North Carolina State
University (1979). The purpose of his research
was to identify the important soil characteristics
influencing aerial biomass production of the
marsh grass, Spartina alterniflora in the Cape
Fear Estuary of North Carolina. The design for
collecting data was such that there were three
types of Spartina vegetation, in each of three
locations, and five random sites within each
location vegetation type.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data

Objective:

Find the substrate variable, or combination of
variables, showing the strongest relationship to
biomass.
Or,

From the list of five independent variables of salinity,
acidity, potassium, sodium, and zinc, find the
combination of one or more variables that has the
strongest relationship with aerial biomass.

Find the independent variables that can be used to
predict aerial biomass.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data

Class vegetative_type location sites



Recall 3 vegetative types evaluated
Recall 3 locations where tests occurred
Recall 5 sites within each location

Model

Biomass = vegetative_type location site(location)
vegetative_type*location salinity acidity potassium
sodium zinc;
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data

Model

Biomass = vegetative_type location site(location)
vegetative_type*location salinity acidity potassium
sodium zinc;

Would need to examine assuming each linear affect
was signficant





salinity*salinity
salinity*salinity*salinity
acidity*acidity
acidity*acidity*acidity,
Etc.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example

Example Strawberry yield is modeled as a function of
strawberry variety, type of fertilizer, and their interaction.
PROC GLM DATA=berry;
CLASS fertiliz variety;
MODEL yield=fertiliz variety Fertiliz*variety /
SOLUTION;
LSMEANS fertiliz variety;
Run;
Quit;

The SOLUTION statement is useful for showing the relative effect
sizes.
I
OWA
S
TATE
U
NIVERSITY

Department of Animal Science
PROC GLM Example Output
General Linear Models Procedure
Class
Level
FERTILIZ
2
VARIETY
2
Information
KN
Red Sweet
Number of observations in data set = 24
This section lets us verify that we have two fertilizers and two
varieties of interest, and that there are 24 observations in the data.
Information about missing observations is also printed here, if
applicable.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output
Dependent Variable: YIELD
Sum of
Mean
Squares
Square
Source
DF
Model
3
0.87166667 0.29055556
Error
20
2.24666667 0.11233333
F Value
Pr > F
2.59
0.0816
Corrected Total 23 3.11833333
R-Square
C.V.
0.279530
3.790707
Root MSE
0.3351617
YIELD Mean
8.8416667
This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value
which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is
(root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares.
This is commonly used to evaluate how well the model fits the data, but it should not be the only
criterion of fit that you examine.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output
Source
FERTILIZ
VARIETY
FERT*VAR
Source
FERTILIZ
VARIETY
FERT*VAR
DF
Type I SS
Mean Square
F Value
Pr > F
1
1
1
0.37500000
0.48166667
0.01500000
0.37500000
0.48166667
0.01500000
3.34
4.29
0.13
0.0826
0.0515
0.7186
DF
Type III SS
Mean Square
F Value
Pr > F
1
1
1
0.37500000
0.48166667
0.01500000
0.37500000
0.48166667
0.01500000
3.34
4.29
0.13
0.0826
0.0515
0.7186
SAS presents Type I and Type III sums of squares and F statistics for their
significance under a particular set of assumptions; namely, that fertilizer and variety
should be modeled with fixed effects, and that the random error terms satisfy their
requirements.
The F test statistics shown here are not always the proper results to interpret! This
depends on the design of the experiment.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output

The Type I sums of squares are also called sequential sums of
squares. Here, they test:
1.
2.
3.

Whether fertilizer is a significant predictor
Whether variety is significant when considered in addition to fertilizer
Whether the interaction is significant when considered in addition to both
fertilizer and variety.
The Type III sums of squares are also called partial sums of
squares. Here, they test:
1.
2.
3.
Assuming that the combinations of fertilizers and varieties are different
from each other, do they show consistent trends for fertilizers to be
different from each other?
Assuming that the combinations of fertilizers and varieties are different
from each other, do they show consistent trends for varieties to be
different from each other?
Knowing that fertilizers and varieties could be different from each other, is
the difference between fertilizers the same for both varieties?
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output
 Because
the experiment is balanced, both
Type I and Type III sums of squares are
identical.
 Usually,
the Type III sums of squares are used
for inference, although the Type I sums of
squares are used in specific situations.
 SAS
can calculate Type II and Type IV sums of
squares as well.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output
 Solution
option used after the model statement
(i.e. /solution;)
T for H:0
Parameter=0
Prob > |T|
Std. Error of
Estimates
9.13 B
66.75
0.001
0.137
K
0.30 B
-1.55
0.137
0.194
N
0.00 B
.
.
.
Red
-0.33 B
Sweet
0.00 B
.
.
.
K Red
0.10 B
0.37
0.719
0.274
K Sweet
0.00 B
.
.
.
N Red
0.00 B
.
.
.
K Sweet
0.00 B
.
.
.
Parameter
Estimate
INTERCEPT
FERTILIZ Variety
Fert x Var
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Output

There are many ways to estimate effects in a linear model with
categorical predictors (fixed effects).

SAS chooses to do so by alphabetizing the levels of each factor,
then assigning an effect size of zero to the last alphabeticallyordered level of each factor and its interactions.

To predict the response for, say, Fertilizer K for the Red variety,
use the equation (Intercept) + (K effect) + (Red effect) + (K*Red
interaction effect), or 9.13 - 0.30 - 0.33 + 0.10 = 8.60.

The t-test values listed on the right can be used to test if certain
parameters are significantly different from zero;


in this case, they compare the levels of each factor to the last alphabetically-ordered
level (which is forced to be zero).
The SOLUTION statement is useful for determining how treatment
effects can be contrasted or estimated within PROC GLM.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Examining the Error
values
 An
analysis of a general linear model should
include a check of the assumptions about the
random error terms.
 To
do this in PROC GLM, you must use an
OUTPUT statement.
 The
following statements show how to
produce a residual plot for the model above.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC GLM Example Examining the
Error values
PROC GLM DATA=berry;
CLASS fertiliz variety;
MODEL yield=fertiliz variety fertiliz*variety/SOLUTION;
OUTPUT OUT=results P=pred R=resid;
PROC GLM DATA=results;
LPOT resid*pred;
RUN;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science

similar documents