### run

```1
1 – Intro & Hist. - Na Chan
2 – Basics of ANOVA - Alla Tashlitsky
3 - Data Collection - Bryan Rong
4 - Checking Assumptions in SAS - Junying Zhang
5 - 1-Way ANOVA derivation - Yingying Lin and Wenyi Dong
6 - 1-Way ANOVA in SAS - Yingying Lin and Wenyi Dong
7 - 2-Way ANOVA derivation - Peng Yang
8 - 2-Way ANOVA in SAS - Phil Caffrey and Yin Diao
9 - Multi-Way ANOVA Derivation - Michael Biro
10 - ANOVA and Regression – Cris (Jiangyang) Liu
2
3
USES OF T-TEST
• A one-sample location test of whether the
mean of a normally distributed population has
a value specified in a null hypothesis.
• A two sample location test of the null
hypothesis that the means of two normally
distributed populations are equal
4
USES OF T-TEST
• A test of the null hypothesis that the
difference between two responses measured
on the same statistical unit has a mean value
of zero
• A test of whether the slope of a regression
line differs significantly from 0
5
BACKGROUND
• If comparing means among > 2 groups, 3 or
more t-tests are needed
-Time-consuming (Number of t-tests increases)
-Inherently flawed (Probability of making a
Type I error increases)
6
RONALD A.FISHER
•
•
•
•
Biologist
Eugenicist
Geneticist
Statistician

Informally used by
researchers in the 1800s

Formally proposed by
Ronald A. Fisher in 1918
“A genius who almost single-handedly created
the foundations for modern statistical science”
- Anders Hald
“The greatest of Darwin's successors”
-Richard Dawkins
7
HISTORY
• Fisher proposed a formal analysis of variance in
his paper The Correlation Between Relatives on
the Supposition of Mendelian Inheritance in 1918.
• His first application of the analysis of variance
was published in 1921.
• Become widely known after being included in
Fisher's 1925 book Statistical Methods for
Research Workers in 1925.
8
DEFINITION
• An abbreviation for: ANalysis Of VAriance
• The procedure to consider means from k
independent groups, where k is 2 or
greater.
9
ANOVA and T-TEST
• ANOVA and T-Test are similar
-Compare means between groups
• 2 groups, both work
• 2 or more groups, ANOVA is better
10
TYPES
• ANOVA - analysis of variance
– One way (F-ratio for 1 factor )
– Two way (F-ratio for 2 factors)
• ANCOVA - analysis of covariance
• MANOVA - multiple analysis
11
APPLICATION
•
•
•
•
•
•
Biology
Microbiology
Medical Science
Computer Science
Industry
Finance
12
13
Definition
• ANOVA can determine whether there is a significant
relationship between variables. It is also used to
determine whether a measurable difference exists
between two or more sample means.
• Objective: To identify important independent variables
(predictor variables – yi’s) and determine how they
affect the response variables.
• One-way, two-way, or multi-way ANOVA depend on the
number of independent variables there are in the
experiment that affect the outcome of the hypothesis
test.
14
Model & Assumptions
•  =  +  +  (Simple Model)
• E(εi) = 0
• Var(ε1) = Var(ε2) = … = Var(εk): homoscedasticity
• All εi’s are independent.
• εi ~ N(0,σ2)
15
Classes of ANOVA
1. Fixed Effects: concrete (e.g. sex,
age)
2. Random Effects: representative
sample (e.g. treatments, locations,
tests)
3. Mixed Effects: combination of fixed
and random
16
Procedure
• H0: µ1=µ2=…=µk vs
Ha: at least one the equalities doesn’t hold
• F~fk,n-(k+1),α = MSR/MSE = t2 (when there are only 2
means)
– Where mean square regression: MSR = SSR/1 and
mean square error: MSE = SSE/n-2
• The rejection region for a given significance level
is F > f
17
Regression
• SST (sum of squares total) = SSR (sum of
squares regression) + SSE (sum of squares
error)
n
•
SST   ( y
i 1
 ( y
 ( y  y
ˆ
)
ˆ

y
)

y
)
i
i
i
i
2
n
i 1
2
n
2
i 1
• Sample variance: S2 = MSE = SSE/n-k →
Unbiased estimator for σ2
18
Mean
Variation
19
20
Data Collection
• 3 industries – Application Software, Credit
Service, Apparel Stores
• Sample 15 stocks from each industry
• For each stock, we observed the last 30 days
and calculated
– Mean daily percentage change
– Mean daily percentage range
– Mean Volume
21
Application software
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
CA, Inc. [CA]
Compuware Corporation [CPWR]
Deltek, Inc. [PROJ]
Epicor Software Corporation [EPIC]
Fundtech Ltd. [FNDT]
Intuit Inc. [INTU]
Lawson Software, Inc. [LWSN]
Microsoft Corporation [MSFT
MGT Capital Investments, Inc. [MGT]
Magic Software Enterprises Ltd. [MGIC]
SAP AG [SAP]
Sonic Foundry, Inc. [SOFO]
RealPage, Inc. [RP]
Red Hat, Inc. [RHT]
VeriSign, Inc. [VRSN]
22
Credit Service
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
American Express Company [AXP]
Asset Acceptance Capital Corp. [AACC]
Capital One Financial Corporation [COF]
CapitalSource Inc. [CSE]
Cash America International, Inc. [CSH]
Discover Financial Services [DFS]
Equifax Inc. [EFX]
Global Cash Access Holdings, Inc. [GCA]
Federal Agricultural Mortgage Corporation [AGM]
Intervest Bancshares Corporation [IBCA]
Manhattan Bridge Capital, Inc. [LOAN]
MicroFinancial Incorporated [MFI]
Moody's Corporation [MCO]
23
APPAREL STORES
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Abercrombie & Fitch Co. [ANF]
American Eagle Outfitters, Inc. [AEO]
bebe stores, inc. [BEBE]
DSW Inc. [DSW]
Express, Inc. [EXPR]
J. Crew Group, Inc. [JCG]
New York & Company, Inc. [NWY]
Nordstrom, Inc. [JWN]
Pacific Sunwear of California, Inc. [PSUN]
The Gap, Inc. [GPS]
The Buckle, Inc. [BKE]
The Children's Place Retail Stores, Inc. [PLCE]
The Dress Barn, Inc. [DBRN]
The Finish Line, Inc. [FINL]
Urban Outfitters, Inc. [URBN]
24
25
26
Final Data look
27
28
Major Assumptions of Analysis of
Variance
• The Assumptions
– Normal populations
– Independent samples
– Equal (unknown) population variances
• Our Purpose
– Examine these assumptions by graphical analysis of residual
29
Residual plot
•
•
Violations of the basic assumptions and model adequacy can
be easily investigated by the examination of residuals.
We define the residual for observation j in treatment i as

eij  yij  y ij
•
If the model is adequate, the residuals should be
structureless; that is, they should contain no obvious
patterns.
30
Normality
• Why normal?
– ANOVA is an Analysis of Variance
– Analysis of two variances, more specifically, the ratio of two variances
– Statistical inference is based on the F distribution which is given by
the ratio of two chi-squared distributions
– No surprise that each variance in the ANOVA ratio come from a parent
normal distribution
• Normality is only needed for statistical inference.
31
Sas code for getting residual
PROC IMPORT datafile = 'C:\Users\junyzhang\Desktop\mydata.xls' out = stock;
RUN;
PROC PRINT DATA=stock;
RUN;
Proc glm data=stock;
Class indu;
Output out =stock1 p=yhat r=resid;
Run;
PROC PRINT DATA=stock1;
RUN;
32
Normality test
The normal plot of the residuals is used to check
the normality test.
proc univariate data= stock1
normal plot;
var resid;
run;
33
Normality Tests
Tests for Normality
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.731203
0.206069
1.391667
7.797847
Normal Probability Plot
8.25+
|
*
|
|
|
*
|
|
*
|
+
4.25+
**
|
++++
** +++
|
*+++
|
+++*
|
++****
|
++++ **
|
++++*****
|
<
>
>
>
W
D
W-Sq
A-Sq
<0.0001
<0.0100
<0.0050
<0.0050
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.989846
0.057951
0.03225
0.224264
<
>
>
>
W
D
W-Sq
A-Sq
0.6521
>0.1500
>0.2500
>0.2500
Normal Probability Plot
2.3+
++ *
|
++*
|
+**
|
+**
|
****
|
***
|
**+
|
**
|
***
|
**+
|
***
0.1+
***
|
**
|
***
|
***
|
**
|
+***
|
+**
|
+**
|
****
|
++
|
+*
-2.1+*++
+----+----+----+----+----+----+----+----+----+----+
++******
0.25+*
* ******************
-2
-1
0
+1
+2
+----+----+----+----+----+----+----+----+----+----+
34
34
Normality
Tests
35
Independence
• Independent observations
– No correlation between error terms
– No correlation between independent variables and error
• Positively correlated data inflates standard
error
– The estimation of the treatment means are more accurate than the
standard error shows.
36
SAS code for independence test
The plot of the residual against the factor is used
to check the independence.
proc plot;
plot resid* indu;
run;
37
Independence Tests
38
Homogeneity of Variances
• Eisenhart (1947) describes the problem of unequal
variances as follows
– the ANOVA model is based on the proportion of the mean
squares of the factors and the residual mean squares
– The residual mean square is the unbiased estimator of 2, the
variance of a single observation
– The between treatment mean squares takes into account not only
the differences between observations, 2, just like the residual
mean squares, but also the variance between treatments
– If there was non-constant variance among treatments, we can
replace the residual mean square with some overall variance,  a2,
and a treatment variance,  t2, which is some weighted version of
 a2
– The “neatness” of ANOVA is lost
39
Sas code for Homogeneity of Variances
test
The plot of residuals against the fitted value is
used to check constant variance assumption.
proc plot;
plot resid* yhat;
run;
40
Data with homogeneity of Variances
41
Tests for Homogeneity of Variances
42
– Normal populations
– Nearly independent samples
– Equal (unknown) population variances
So we can employ ANOVA to analyze our data.
43
44
Derivation – 1-Way ANOVA
• Hypotheses
– H0: μ= μ1 = μ2 = μ3 = … = μn
– H1: μi ≠ μj for some i,j
• We assume that the jth observation in group i is
related to the mean by xij = μ+ (μi – μ) + εij, where εij
is a random noise term.
• We wish to separate the variability of the individual
observations into parts due to differences between
groups and individual variability
45
Derivation – 1-Way ANOVA – Cont’
46
Derivation – 1-Way ANOVA – Cont’
• Using the above equation, we define
• We can show that
47
Derivation – 1-Way ANOVA – Cont’
• Given the distributions of the MSS values, we
can reject the null hypothesis if the between
group variance is significantly higher than the
within group variance. That is,
• We reject the null hypothesis if F > fn-1,N-n,α
48
Brief Summary Statistics
• Code
proc means data=stock maxdec=5 n mean std;
by industry;
Get simple summary statistics(sample size,
sample mean and SD of each industry) with
max of 5 decimal places
49
Brief Summary Statistics
• Output
Industry
N
Mean
Std Dev
Apparel
Stores
15
0.00253
0.00356
Application
Software
15
0.00413
0.00742
Credit Service
15
0.00135
0.00443
50
Data Plot
• Code
proc plot data=stock;
Produce crude graphical output
51
Data Plot
• Output
Plot of industry*ADPC. Legend: A = 1 obs, B = 2 obs, D = 4 obs.
industry
|
CreditSe +
Applicat +
A
A B A AAA AABA A A
A
D A AAAAA A
AA
A
ApparelS +
AA B A B B B A BA
|
-+---------+---------+---------+---------+---------+---------+---------+-----0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020
0.025
52
One Way ANOVA Test
•
•
•
•
Code
proc anova data=stock;
class industry;
Class statement
indicates that
“industry” is a
factor.
Assumes”industry”influences average daily percentage change.
• means industry/tukey cldiff;
Multiple comparison by Tukey’s method—get actual
Confidence Intervals.
Get pictorial
display of
comparisons.
• means industry/tukey lines;
53
GLM analysis
• Code
proc glm data=stock;
class industry;
output out=stockfit p=yhat r=resid;
This procedure is similar to 'proc anova' but
'glm' allows residual plots but gives more junk
output.
54
One Way ANOVA Test
• Output
Sum of
Source
DF
Squares Mean Square F Value
1.00
Model
2
0.00005833 0.00002916
1.00
Error
42 0.00122217 0.00002910
Corrected Total 44 0.00128050
Source
industry
Pr > F
0.3757
0.3757
R-Square Coeff Var Root MSE ADPC Mean
0.045552 201.8054
0.005394 0.002673
DF
Anova SS Mean Square F Value Pr > F
2 0.00005833 0.00002916
1.00
0.3757
55
One Way ANOVA Test
Tukey's Studentized Range (HSD) Test for ADPC
Alpha
Error Degrees of Freedom
0.05
42
Error Mean Square
.000029
Critical Value of Studentized Range
3.43582
Minimum Significant Difference
.0048
56
One Way ANOVA Test
Industry
Comparison
Applicat - ApparelS
Applicat - CreditSe
ApparelS - Applicat
ApparelS - CreditSe
CreditSe - Applicat
CreditSe - ApparelS
Difference
Between
Means
0.001601
0.002778
-0.001601
0.001177
-0.002778
-0.001177
Simultaneous 95%
Confidence Limits
-0.003184 0.006387
-0.002008 0.007563
-0.006387 0.003184
-0.003609 0.005962
-0.007563 0.002008
-0.005962 0.003609
57
Univariate Procedure
• Code
• proc univariate data=stockfit plot normal;
• var resid;
We use the proc univariate to produce
the stem-and-leaf and normal
probability plots and we use the stemleaf plot to visualize the overall
distribution of a variable.
58
Univariate Procedure
• Output
Moments
N
45
Sum Weights
45
Mean
0
Sum Observations
0
Std Deviation 0.00527035 Variance
0.00002778
Skewness
1.33008795 Kurtosis
5.46395169
UncorrectedSS 0.00122217Corrected SS 0.00122217
Coeff Variation
. Std Error Mean 0.00078566
59
Tests for Location: Mu0=0
Test
-Statistic- -----p Value-----Student's t t
0 Pr > |t| 1.0000
Sign
M -1.5 Pr >= |M| 0.7660
Signed Rank S -43.5 Pr >= |S| 0.6288
60
Basic Statistical Measures
Location
Variability
Mean 0.00000 Std Deviation 0.00527
Median -0.00048 Variance 0.0000278
Mode
.
Range
0.03389
Interquartile Range 0.00623
61
Tests for Normality
Test
--Statistic-------p Value-----Shapiro-Wilk
W 0.904256
Pr < W 0.0013
Kolmogorov-Smirnov D 0.112584 Pr > D >0.1500
Cramer-von Mises W-Sq 0.096018 Pr > W-Sq 0.1266
Anderson-Darling A-Sq 0.781507 Pr > A-Sq 0.0410
62
Quantiles
Quantile
Estimate
100% Max
0.021509105
99%
0.021509105
95%
0.007261567
90%
0.005106613
75% Q3
0.002667399
50% Median -0.000477723
25% Q1
-0.003565176
10%
-0.004824061
5%
-0.005444811
1%
-0.012376248
0% Min
-0.012376248
63
Extreme Observations
-------Lowest------Value
Obs
-0.01237625
-0.00807339
-0.00544481
-0.00483936
-0.00482406
-------Highest-----Value
41
25
13
3
28
Obs
0.00510661
0.00596875
0.00726157
0.00814126
0.02150911
6
34
29
27
22
64
Stem Leaf Plot and Boxplot
Stem Leaf
#
Boxplot
20 5
1
*
18
16
14
12
10
8
1
1
|
6
03
2
|
4
4561
4
|
2
0027922
7
+-----+
0
334669
6
| + |
-0
9809753
7
*-----*
-2
97688551
8
+-----+
-4
4888772
7
|
-6
|
-8
1
1
|
-10
|
-12 4
1
|
----+----+----+----+
Multiply Stem.Leaf by 10**-3
65
Plot
•
•
•
•
•
Code
proc plot;
plot resid*industry;
plot resid*yhat;
run;
Plot the qq graph of residual VS industry, and
residual VS the approximated ADPC value.
66
Normal Probability Plot
0.021+
*
|
|
|
|
+++
|
++++
|
++*
|
++++*
|
++*****
|
+*****
|
+****
|
*****
|
******
|
* ******+
|
++++
|
*++
| ++++
-0.013++++*
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
67
Graph
0.025 +
|
A
0.020 +
0.010 +
|
A
|
A
|
A
0.005 + B
| A
A
| A
C
| B
A
B
|
A
0.000 + C
B
| A
B
| A
B
A
|
A
B
| B
A
A
-0.005 + B
D
|
A
-0.010 +
|
A
-0.015 +
|
---+-------------------------+-------------------------+-industry ApparelS
Applicat
CreditSe
Plot of
resid*industry.
Legend:
A = 1 obs
B = 2 obs
D = 4 obs
68
Plot of resid*yhat
resid
0.025 +
|
A
0.010 +
|
A
|
A
|
A
0.005 +
B
|
A
A
|
C
A
|
B
B
A
|
A
0.000 +
B
C
|
A
B
|
A
A
B
|
B
A
|
A
B
A
-0.005 +
B
D
|
A
|
A
-0.015 +
--+------------+------------+------------+------------+------------+-----------0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
yhat
Plot of
resid*yhat.
Legend:
A = 1 obs,
B = 2 obs,
D=4 obs.
69
Conclusion
• After the analysis of one way anova test,we
can get the result of F=1.00 and p=0.3757.
Since the p-value is bigger, we accept the null
hypothesis which indicates that there is no
difference between the mean of daily average
percentage change of stocks of different
industries. Thus, there is no different if we buy
the stocks in different industries in the long
term.
70
71
We now have two factors (A & B)
Totaling
Tests to Conduct
72
Linear
Dot
Notation
Model
.. =

=
.
. =

=

=

=
=  +  +

=

=1

=

=1
=
letting
= ..
=+()
. −   +
= . − ,
() =  −  −  −

=1() =

=1() = 0
∀   .
73
Least Square Method

. .

− …

− =

=

=
− …

SST
= = =
=

−
SSA
2
2
=
= SST =SSA
.. − … + SSE
.. − … ++SSB+ SSAB

=1 =1 =1 +
=1 =1 =1

−
SSB
2
2
+
. − .. − .. − … +=
− .

=1 =1 =1 +
=1 =1 =1

− −
SSAB

=
=

2
2
2
=
2 + +  +
+

SSE
=1
−
=1
=1
=1
=1 =1 =1
= = =
74
Rejection
Test Criteria
Conditions
least one  ≠ 0 ∀  = 1,2, … , .
0 = 1 = 2 =
⋯ = = 0.

0 : At >

=
−,−,

0 = 1 = 2 =
⋯ =
=0
least
one  ≠ 0 ∀  = 1,2, … , .
.

=0 : At >
−,−,

=
>   (−),−,
0 =  11 =  12
= ⋯ =
= 0 .−
0 : At least one   ≠ 0
∀  = 1,2, … ,    = 1,2, … ,
75
Pivotal Quantity
=  =  = ⋯ =  =  .  : At least one  ≠  ∀  = , , … , .
=  +  +
+
+
+ ()
()+
+

=  +  + () +
76
Pivotal Quantity (Cont’)

−
=
−

′ =

=
−

= = =
=
=

∴

+
+

= = =
= = =
= = =

~
∗ − −

−
−
=
=  +
−
=

=
~
∗ −−
−
−
77
Two-Way ANOVA in SAS
By: Philip Caffrey
&
Yin Diao
78
Model
• An extension of one way ANOVA. It provides more
insight about how the two IVs interact and individually
affect the DV. Thus, the main effects and interaction
effects of two IVs have on the DV need to be tested.
• Model:
=  +  +  + () +
• Null hypothesis:
0 = 1 = 2 = ⋯ =  = 0 . 0 : At least one  ≠ 0 ∀  = 1,2, … , .
0 = 1 = 2 = ⋯ =  = 0 . 0 : At least one  ≠ 0 ∀  = 1,2, … , .
0 =
11
=
= ⋯ =   = 0 . 0 : At least one
∀  = 1,2, … ,    = 1,2, … ,
12

≠0
79
Sum of Squares
Every term compared with the error term leads to F
distribution. In this way, we can conclude whether there
is main effect or interaction effect.
SSTOTAL = SSA + SSB + SSINTERACTION + SSERROR
80
Example
Using the same data from the One-Way
analysis, we will now separate the data further
by introducing a second factor, Average Daily
Volume.
81
Example
Factor 1: Industry
• Apparrel Stores
• Application Software
• Credit Services
Factor 2: Average Daily Volume
• Low
• Medium
• High
82
Two-Way Design
Repeat 5 times
each
V
O
L
U
M
E
High
Medium
Low
Credit
Apparel
Software
INDUSTRY
83
Using SAS
SAS code:
PROC IMPORT DATAFILE=PROC IMPORT
DATAFILE='G:\Stony Brok Univ Text Books\AMS
Project\Data.xls' OUT=TWOWAY;
RUN;
PROC ANOVA DATA = TWOWAY;
TITLE “ANALYSIS OF STOCK DATA”;
CLASS INDUSTRY VOLUME;
MODEL ADPC = INDUSTRY | VOLUME;
MEANS INDUSTRY | VOLUME / TUKEY CLDIFF;
RUN;
84
Using SAS
/*PLOT THE CELL MEANS*/
PROC MEANS DATA=WAY NWAY NOPRINT;
OUTPUT OUT=MEANS MEAN=;
RUN;
PROC GPLOT DATA=MEANS;
RUN;
85
ANOVA Table
Tests ofBetw een-Subjects Effects
Source
Sum of
Squares
Mean
Square
df
F
Sig.
.000a
8
3.335E-5
1.184
.335
Industry
6.906E-5
2
3.453E-5
1.226
.305
Volume
9.534E-5
2
4.767E-5
1.693
.198
Industry *
Volume
7.950E-5
4
1.988E-5
.706
.593
Error
.001
36
2.816E-5
Corrected
Total
.001
44
Corrected
Model
No Sig.
Results
86
Using SAS
To test the main effect of one IV, we should
combine all the data of the other IV. And
this is done in the one way ANOVA.
From the ANOVA we know there is no
significant main effects or interaction effect
of the two IVs.
To indicate if there is an interaction effect,
we can plot of means of each cell formed
by combination of all levels of IVs.
87
PLOT OF CELL MEANS
Industry by Average Daily Volume
88
Interpreting the Output
Given that the F tests were not significant we would
normally stop our analysis here.
If the F test is significant, we would want to know
exactly which means are different from each other.
Use Tukey’s Test.
MEANS INDUSTRY | VOLUME / TUKEY CLDIFF;
89
Interpreting the Output
Comparing Means
Comparison
Diff. b/w Means
95% CI
Software - Apparel
0.001601
[-0.003184 0.006387]
Software - Credit
0.002778
[-0.002008 0.007563]
Credit - Apparel
-0.001177
[-0.005962 0.003609]
MedVol. - LowVol.
-0.003698
[-0.008435 0.001038]
Med.Vol. - HighVol.
-0.001252
[-0.005989 0.003484]
HighVol. - LowVol.
-0.002446
[-0.007182 0.002290]
90
Conclusion
• We cannot conclude that there is a significant
difference between any of the group means.
• The two IVs have no effects on the DV.
91
92
M-way ANOVA
(Derivation)
• Let us have n factors, A1,A2,…,An , each with 2 or
more levels, a1,a2,…,an, respectively. Then there
are N = a1a2…an types of treatment to conduct,
with each treatment having sample size ni. Let
xi1i2…ink be the kth observation from treatment
i1i2…in .
• By the assumption for ANOVA, xi1i2…ink is a
random variable that follows the normal
distribution. Using the model xi1i2…ink = µi1i2…ink +
εi1i2…ink where each (residual) εi1i2…ink are i.i.d. and
follows N(0,σ2).
93
M-way ANOVA
(Derivation)
Using “dot notation”, let
,
, …,
,…,
.
Let
,
, where
mean, and
is the grand mean (see above),
is the mean effect of factor
and
is the mean effect of factor subtract by the grand
subtract by the grand mean. Then we can model the above
as a linear equation of
94
M-way ANOVA
(Derivation)
Applying Least Square Estimation we get
Which is the ANOVA Identity,
95
M-way ANOVA
(Derivation)
• These are all distributed as independent χ2
random variables (when multiplied by the
correct constants and when some hypotheses
hold) with d.f. satisfying the equation:
96
M-way ANOVA
(Derivation)
• There are a total of 2m hypotheses in an mway ANOVA.
– The null hypothesis, which states that there is no
difference or interaction between factors
– For k from 1 to m, there are mCk alternative
hypotheses about the interaction between every
collection of k factors.
– Then we have 1 + mC1 + mC2 + … + mCm = 2m by
a well known combinatorial identity.
97
M-way ANOVA
(Derivation)
• These hypotheses are:
At least one
At least one
...
At least one
At least one
...
Test for all combination of
98
M-way ANOVA
(Derivation)
• We want to see if the variability between
groups is larger that the variability within the
groups.
• To do this, we use the F distribution as our
pivotal quantity, and then we can derive the
proper tests, very similar to the 1-way and 2way tests.
99
M-way ANOVA
(Derivation)
...
...
...
Continue to see whether all combination of
100
RELATIONSHIP BETWEEN
ANOVA and Regression
Presenter: Cris J.Y. Liu
101
• What we know:
– regression is the statistical model that you use to predict
a continuous outcome on the basis of one or more
continuous predictor variables.
– ANOVA compares several groups (usually categorical
predictor variables) in terms of a certain dependent
variable(continuous outcome )
( if there are mixture of categorical and continuous data,
ANCOVA is an alternative method.)
• Take a second look:
They are the just different sides of the same coin!
102
Review of ANOVA
• Compare the means of different groups
• n groups, ni elements for ith group, N element
in total.
• SST= SSbetween + SSwithin
two group,X and Y,
Each have n data?
103
Review of Simple Linear Regression
• We try to find a line y = β0 + β1 x that best fits our
data so that we can calculate the best estimate of y
from x
• It will find such β0 and β1 that minimize the distance Q between the actual
and estimated score
Minimize me
• Let predicted value be of one group, while
the other group consist all of original value ..
• It is a special (and also simple) case of ANOVA!
104
Review of Regression
Total
=
Model
+
(Between)
=
d.f.: n-1
Error
(Within)
+
d.f.: 2-1 = 1
d.f.:n-2
105
ANOVA table of Regression
106
How are they alike?
• If we use the group mean to be our X values from
which we predict Y we can see that ANOVA and
regression is the same!!
• The group mean is the best prediction of a Y-score.
107
Term comparison
Regression
ANOVA
Dependent variable
Explaintory variable
total mean
SSR
SSE
SSbetween
SSwithin
108
Term comparison
if more than one predictor…..
Regression
ANOVA
Multiple Regression
Multi-way ANOVA
dummy variable
categorical variable
interaction effect
covariance
………………….
……………
109
Notes:
• Both of them are applicable only when
outcome variables are continuous.
• They share basically the same procedure of
checking the underlying assumption.
110
Robust ANOVA
-Taguchi Method
111
What is Robustness?
• The term “robustness” is often used to refer to methods
designed to be insensitive to distributional assumptions
(such as normality) in general, and unusual observations
(“outliers”) in particular.
• Why Robust ANOVA?
• There is always the possibility that some observations may
contain excessive noise.
• excessive noise during experiments might lead to incorrect
inferences.
• Widely used in Quality control
112
Robust ANOVA
• What we want from robust ANOVA?
robust ANOVA methods could withstand nonideal conditions while no more difficult to
perform than ordinary ANOVA
• Standard technique----least squares method is
highly sensitive to unusual observations
113
Robust ANOVA
Our aim is to minimize by choosing β:
In standard ANOVA, we let
we can also try some other ρ(x) .
114
Least absolute deviation
• It is well-known that the median is much more robust
to outliers than the mean.
• least absolute deviation (LAD) estimate, which takes
• How is LAD related to median?
the LAD estimator determines the “center” of the data
set by minimizing the sum of the absolute deviations
from the estimate of the center, which turns out to be
the median.
• It has been shown to be quite effective in the presence
of fat tailed data
115
M-estimation
• M-estimation is based on replacing ρ(.) with a
function that is less sensitive to unusual
observations than is the quadratic .
• The M means we should keep ρ follows MLE.
• LSD with
, is an example of a robust
M-estimator.
• Another popular choice of ρ : Tukey bisquare:
and (;)1rcρ= otherwise, where r is the residual
and c is a constant.
116
Suggestion
• these robust analyses may not take the place
of standard ANOVA analyses in this context;
• Rather, we believe that the robust analyses
should be undertaken as an adjunct to the
standard analyses
117
118
119
```