Report

Use of Proc GLM to Analyze Experimental Data Animal Science 500 Lecture No. October , 2010 IOWA STATE UNIVERSITY Department of Animal Science PROC GLM The GLM procedure uses the method of least squares to fit general linear models. Among the statistical methods available in PROC GLM are: Regression, Analysis of variance, Analysis of covariance, Multivariate analysis of variance (MANOVA), and partial correlation. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM PROC GLM analyzes data within the framework of general linear models. PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. The independent variables can be either classification variables, which divide the observations into discrete groups, or continuous variables. Thus, the GLM procedure can be used for many different analyses, including the following: IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM Thus, the GLM procedure can be used for many different analyses, including the following: simple regression multiple regression analysis of variance (ANOVA), especially for unbalanced data analysis of covariance response surface models weighted regression polynomial regression partial correlation multivariate analysis of variance (MANOVA) repeated measures analysis of variance IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM PROC GLM enables you to specify any degree of interaction (crossed effects) and nested effects. It also provides for polynomial, continuous-by-class, and continuous-nesting-class effects. Through the concept of estimability, the GLM procedure can provide tests of hypotheses for the effects of a linear model regardless of the number of missing cells or the extent of confounding. PROC GLM displays the sum of squares (SS) associated with each hypothesis tested and, upon request, the form of the estimable functions employed in the test. PROC GLM can produce the general form of all estimable functions. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM The REPEATED statement enables you to specify effects in the model that represent repeated measurements on the same experimental unit for the same response, providing both univariate and multivariate tests of hypotheses. The RANDOM statement enables you to specify random effects in the model; expected mean squares are produced for each Type I, Type II, Type III, Type IV, and contrast mean square used in the analysis. Upon request, tests that use appropriate mean squares or linear combinations of mean squares as error terms are performed. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM The ESTIMATE statement enables you to specify an vector for estimating a linear function of the parameters . The CONTRAST statement enables you to specify a contrast vector or matrix for testing the hypothesis that . When specified, the contrasts are also incorporated into analyses that use the MANOVA and REPEATED statements. The MANOVA statement enables you to specify both the hypothesis effects and the error effect to use for a multivariate analysis of variance. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM PROC GLM can create an output data set containing the input data set in addition to predicted values, residuals, and other diagnostic measures. PROC GLM can be used interactively. After you specify and fit a model, you can execute a variety of statements without recomputing the model parameters or sums of squares. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide PROC GLM For analysis involving multiple dependent variables but not the MANOVA or REPEATED statements, a missing value in one dependent variable does not eliminate the observation from the analysis for other dependent variables. PROC GLM automatically groups together those variables that have the same pattern of missing values within the data set or within a BY group. This ensures that the analysis for each dependent variable brings into use all possible observations. IOWA STATE UNIVERSITY Department of Animal Science SAS/STAT(R) 9.22 User's Guide Estimable Function Often see an error in SAS non-est. What does this mean? IOWA STATE UNIVERSITY Department of Animal Science Estimability Generalized inverses are used to obtain solutions for effects in general linear models. There are many generalized inverses. Many different sets of solutions are possible. Estimable are unique and don’t depend on the generalized inverse used to obtain solutions. To analyze data properly, that is answer the hypothesis being tested, the scientist should know what function of the parameters in the model are being estimated. IOWA STATE UNIVERSITY Department of Animal Science Estimability The hypothesis being tested is NOT the absolute values for a level of a factor in the model. Usually asking or hypothesizing that two means are different or some treatment is different from a control. Hence the differences are estimable function NOT the values (solutions) for any of the functions. IOWA STATE UNIVERSITY Department of Animal Science The General Linear Model The main effects general linear model can be parameterized as Yij = µ + αi + bj + εij Where Y observation for ith α, µ is the overall mean (unknown fixed parameter), αi effect of the ith value of α (αi - µ), bj effect of the jth value of b (bj - µ), and εij is the experimental error N(0,δ2) IOWA STATE UNIVERSITY Department of Animal Science The General Linear Model In matrix terminology, the general linear model may be expressed as Y = Xβ + ε where Y the observed data vector, X the design matrix, β is a vector of unknown fixed effect parameters, and ε is the vector of errors IOWA STATE UNIVERSITY Department of Animal Science Programming the General Linear Model In the GLM procedure, one saves the data set plus the residuals, predicted values, and studentized residuals with an output statement in a data set called resdat. PROC GLM; class machine operator; Model yield=machine|operator; output out=resdat r=resid p=pred student=stdres rstudent=rstud IOWA STATE Uh=lev; NIVERSITY cookd=cksd Department of Animal Science Assumptions of the general linear model E (ε) = 0 var(ε) = σ2 I var(Y) = σ2 I E(Y ) = Xβ IOWA STATE UNIVERSITY Department of Animal Science Assumptions of the Linear Regression Model 1. 2. 3. 4. 5. 6. 7. 8. 9. Linear Functional form Fixed independent variables Independent observations Representative sample and proper specification of the model (no omitted variables) Normality of the residuals or errors Equality of variance of the errors (homogeneity of residual variance) No multicollinearity No autocorrelation of the errors No outlier distortion IOWA STATE UNIVERSITY Department of Animal Science Explanation of the Assumptions 1. Linear Functional form 2. The Observations are Independent observations 3. Heteroskedasticity precludes generalization and external validity This too distorts the significance tests being used Multicollinearity (many of the traits exhibit collinearity) 6. Permits proper significance testing similar to ANOVA and other statistical procedures Equal variance (or no heterogenous variance) 5. Representative sample from some larger population If the observations are not independent results in an autocorrelation which inflates the t and r and f statistics which in turn distorts the significance tests Normality of the residuals 4. Does not detect curvilinear relationships Biases parameter estimation. Can prevent the analysis from running or converging (getting your answers) Severe or several outliers will distort the results and may bias the results. If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates IOWA STATE UNIVERSITY Department of Animal Science SAS test for residual normality Proc univariate data=resdat normal plot; var resid; Run; Quit; IOWA STATE UNIVERSITY Department of Animal Science Graphically examining residuals for homogeneity Proc gplot data=resdat; plot resid * pred; Run; Quit; Analysis for lack of pattern; IOWA STATE UNIVERSITY Department of Animal Science Testing for outliers Proc freq data=resdat; tables stdres cksd; Run; Quit; 1. Look for standardized residuals greater than 3.5 or less than – 3.5 2. And look for high Cook’s D (greater than 4*p/(n-p-1). IOWA STATE UNIVERSITY Department of Animal Science Class Statement Variables included in the CLASS statement referred to as class variables. Specifies the variables whose values define the subgroup combinations for the analysis. Represent various level of some factors or effects Treatment (1,….n) Season (spring, summer, fall, and winter coded 1 through 4) Breed Color Sex Line Day Laboratory IOWA STATE UNIVERSITY Department of Animal Science Evaluating outliers 1.Check coding to spot typos 2. Correct typos 3. If observational outlier is correct, Examine the dffits option to see determine how much influence the outlier has on the fitting statistics. This will show the standardized influence of the observation on the fit. If the influence of the outlier is bad, then consider removal making it a missing observation ( . ) IOWA STATE UNIVERSITY Department of Animal Science Getting started with GLM IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Syntax PROC GLM <options> ; CLASS variables </ option> ; MODEL dependent-variables=independent-effects </ options> ; IOWA STATE UNIVERSITY Department of Animal Science Positional Requirements for PROC GLM Statements Statement ABSORB BY CLASS CONTRAST ESTIMATE FREQ ID LSMEANS MANOVA Must Precede... First RUN statement First RUN statement MODEL statement MANOVA, REPEATED, or RANDOM statement First RUN statement First RUN statement MODEL statement CONTRAST or MODEL statement MODEL statement CONTRAST, ESTIMATE, LSMEANS, or MEANS statement OUTPUT RANDOM REPEATED TEST MANOVA or REPEATED statement IOWA STATE UNIVERSITY WEIGHT Department of Animal Science MODEL statement MODEL statement MEANS MODEL Must Follow... First RUN statement CLASS statement MODEL statement CONTRAST or MODEL statement CONTRAST, MODEL, or TEST statement MODEL statement Statements in the GLM Procedure Statement ABSORB BY CLASS CONTRAST ESTIMATE FREQ ID LSMEANS MANOVA MEANS MODEL OUTPUT RANDOM REPEATED STORE TEST IOWA STATE UNIVERSITY WEIGHT Department of Animal Science Description Absorbs classification effects in a model Specifies variables to define subgroups for the analysis Declares classification variables Constructs and tests linear functions of the parameters Estimates linear functions of the parameters Specifies a frequency variable Identifies observations on output Computes least squares (marginal) means Performs a multivariate analysis of variance Computes and optionally compares arithmetic means Defines the model to be fit Requests an output data set containing diagnostics for each observation Declares certain effects to be random and computes expected mean squares Performs multivariate and univariate repeated measures analysis of variance Requests that the procedure save the context and results of the statistical analysis into an item store Constructs tests that use the sums of squares for effects and the error term you specify Specifies a variable for weighting observations Class Variables Are usually things you would like to account for in your model Can be numeric or character Can be continuous values They are generally not used in regression analyses What meaning would they have IOWA STATE UNIVERSITY Department of Animal Science Class Statement Options Ascending sorts class variable in ascending order Descending sorts class variable in descending order Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all IOWA STATE UNIVERSITY Department of Animal Science Discrete Variables A discrete variable is one that cannot take on all values within the limits of the variable. Limited to whole numbers For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. The variable cannot have the value 1.7. A variable such as a person's height can take on any value. Discrete variables also are of two types: 1. 2. unorderable (also called nominal variables) orderable (also called ordinal) IOWA STATE UNIVERSITY Department of Animal Science Discrete Variables Data sometimes called categorical as the observations may fall into one of a number of categories for example: Any trait where you score the value Lameness scores Body condition scores Soundness scoring Reproductive Feet and leg Behavioral traits Fear test Back test Vocal scores Body lesion scores IOWA STATE UNIVERSITY Department of Animal Science Discrete Variables When do discrete variables become continuous or do they? What is a trait like number born alive considered discrete or continuous? IOWA STATE UNIVERSITY Department of Animal Science Example Variables Data: The dependent variable (what is being measured) is aerial biomass and there are five substrate measurements: (These are the independent variables) 1. 2. 3. 4. Salinity, Acidity, Potassium, Sodium, and Zinc. IOWA STATE UNIVERSITY Department of Animal Science Covariates a covariate is a independent variable that contribute variation to the dependent variable of interest. The research wants to account for the covariate differences that occurs for each observation. A covariate may be of direct interest or it may be a confounding or interacting type of variable IOWA STATE UNIVERSITY Department of Animal Science Covariates Examples Weight of animal at measurement Age of animal at measurement Age of animal at weaning Parity of sow for number born alive and weaning weight Days of lactation for milk weight IOWA STATE UNIVERSITY Department of Animal Science Covariates Covariate may influence the dependent variable in the following ways Linear covariate Quadratic covariate Cubic covariate IOWA STATE UNIVERSITY Department of Animal Science Covariates Check If to be sure your covariate is significant the linear is significant, test the quadratic If the linear and quadratic are significant sources of variation test the cubic How do you do that? IOWA STATE UNIVERSITY Department of Animal Science Covariates How do you do that? Linear include the variable name in the model not listed in the class statement. Example weight Quadratic the variable name is included as follows weight*weight Cubic the variable name is included as follows weight*weight*weight IOWA STATE UNIVERSITY Department of Animal Science Covariates Covariate may influence the dependent variable in the following ways Linear covariate Quadratic covariate Independent covariate affects the dependent variable in a linear manner Independent covariate affects the dependent variable in a linear quadratic manner Indicates there is an inflection point (and only one) Cubic covariate Independent covariate affects the dependent variable in a linear cubic manner Indicates there are two inflection points IOWA STATE UNIVERSITY Department of Animal Science Covariates Covariate may influence the dependent variable in the following ways Linear covariate Independent covariate affects the dependent variable in a linear manner Dependent variable increase or decreases at a constant rate IOWA STATE UNIVERSITY Department of Animal Science Covariates Covariate may influence the dependent variable in the following ways Quadratic covariate Independent covariate affects the dependent variable in a linear quadratic manner Indicates there is an inflection point (and only one) The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa IOWA STATE UNIVERSITY Department of Animal Science Covariates Cubic covariate Independent covariate affects the dependent variable in a linear cubic manner Indicates there are two inflection points Essentially the same as quadratic but the changes can occur at an additional point The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa IOWA STATE UNIVERSITY Department of Animal Science Model Development and Selection of Variables Example: The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora. IOWA STATE UNIVERSITY Department of Animal Science Example Data Origination (Dr. P. J. Berger) Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type. IOWA STATE UNIVERSITY Department of Animal Science Example Data Objective: Find the substrate variable, or combination of variables, showing the strongest relationship to biomass. Or, From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass. Find the independent variables that can be used to predict aerial biomass. IOWA STATE UNIVERSITY Department of Animal Science Example Data Class vegetative_type location sites Recall 3 vegetative types evaluated Recall 3 locations where tests occurred Recall 5 sites within each location Model Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc; IOWA STATE UNIVERSITY Department of Animal Science Example Data Model Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc; Would need to examine assuming each linear affect was signficant salinity*salinity salinity*salinity*salinity acidity*acidity acidity*acidity*acidity, Etc. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Example Strawberry yield is modeled as a function of strawberry variety, type of fertilizer, and their interaction. PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety Fertiliz*variety / SOLUTION; LSMEANS fertiliz variety; Run; Quit; The SOLUTION statement is useful for showing the relative effect sizes. I OWA S TATE U NIVERSITY Department of Animal Science PROC GLM Example Output General Linear Models Procedure Class Level FERTILIZ 2 VARIETY 2 Information KN Red Sweet Number of observations in data set = 24 This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output Dependent Variable: YIELD Sum of Mean Squares Square Source DF Model 3 0.87166667 0.29055556 Error 20 2.24666667 0.11233333 F Value Pr > F 2.59 0.0816 Corrected Total 23 3.11833333 R-Square C.V. 0.279530 3.790707 Root MSE 0.3351617 YIELD Mean 8.8416667 This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output Source FERTILIZ VARIETY FERT*VAR Source FERTILIZ VARIETY FERT*VAR DF Type I SS Mean Square F Value Pr > F 1 1 1 0.37500000 0.48166667 0.01500000 0.37500000 0.48166667 0.01500000 3.34 4.29 0.13 0.0826 0.0515 0.7186 DF Type III SS Mean Square F Value Pr > F 1 1 1 0.37500000 0.48166667 0.01500000 0.37500000 0.48166667 0.01500000 3.34 4.29 0.13 0.0826 0.0515 0.7186 SAS presents Type I and Type III sums of squares and F statistics for their significance under a particular set of assumptions; namely, that fertilizer and variety should be modeled with fixed effects, and that the random error terms satisfy their requirements. The F test statistics shown here are not always the proper results to interpret! This depends on the design of the experiment. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output The Type I sums of squares are also called sequential sums of squares. Here, they test: 1. 2. 3. Whether fertilizer is a significant predictor Whether variety is significant when considered in addition to fertilizer Whether the interaction is significant when considered in addition to both fertilizer and variety. The Type III sums of squares are also called partial sums of squares. Here, they test: 1. 2. 3. Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for fertilizers to be different from each other? Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for varieties to be different from each other? Knowing that fertilizers and varieties could be different from each other, is the difference between fertilizers the same for both varieties? IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output Because the experiment is balanced, both Type I and Type III sums of squares are identical. Usually, the Type III sums of squares are used for inference, although the Type I sums of squares are used in specific situations. SAS can calculate Type II and Type IV sums of squares as well. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output Solution option used after the model statement (i.e. /solution;) T for H:0 Parameter=0 Prob > |T| Std. Error of Estimates 9.13 B 66.75 0.001 0.137 K 0.30 B -1.55 0.137 0.194 N 0.00 B . . . Red -0.33 B Sweet 0.00 B . . . K Red 0.10 B 0.37 0.719 0.274 K Sweet 0.00 B . . . N Red 0.00 B . . . K Sweet 0.00 B . . . Parameter Estimate INTERCEPT FERTILIZ Variety Fert x Var IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Output There are many ways to estimate effects in a linear model with categorical predictors (fixed effects). SAS chooses to do so by alphabetizing the levels of each factor, then assigning an effect size of zero to the last alphabeticallyordered level of each factor and its interactions. To predict the response for, say, Fertilizer K for the Red variety, use the equation (Intercept) + (K effect) + (Red effect) + (K*Red interaction effect), or 9.13 - 0.30 - 0.33 + 0.10 = 8.60. The t-test values listed on the right can be used to test if certain parameters are significantly different from zero; in this case, they compare the levels of each factor to the last alphabetically-ordered level (which is forced to be zero). The SOLUTION statement is useful for determining how treatment effects can be contrasted or estimated within PROC GLM. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Examining the Error values An analysis of a general linear model should include a check of the assumptions about the random error terms. To do this in PROC GLM, you must use an OUTPUT statement. The following statements show how to produce a residual plot for the model above. IOWA STATE UNIVERSITY Department of Animal Science PROC GLM Example Examining the Error values PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety fertiliz*variety/SOLUTION; OUTPUT OUT=results P=pred R=resid; PROC GLM DATA=results; LPOT resid*pred; RUN; Quit; IOWA STATE UNIVERSITY Department of Animal Science