Session slides - Kellogg School of Management

Report
Correlation
... beware
Definition
Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y)
Corr ( X , Y ) 
Cov ( X , Y )
StdDev ( X )  StdDev ( Y )
The correlation between two random variables
is a dimensionless number between 1 and -1.
Interpretation
Correlation measures the strength of the linear
relationship between two variables.
• Strength
– not the slope
• Linear
– misses nonlinearities completely
• Two
– shows only “shadows” of multidimensional
relationships
A correlation of +1 would
arise only if all of the
points lined up perfectly.
Stretching the diagram horizontally or
vertically would change the perceived
slope, but not the correlation.
A positive correlation
signals that large values of
one variable are typically
associated with large
values of the other.
Correlation measures the
“tightness” of the clustering
about a single line.
A negative correlation
signals that large values of
one variable are typically
associated with small
values of the other.
Independent random
variables have a
correlation of 0.
But a correlation of 0
most certainly does
not imply
independence.
Indeed, correlations can
completely miss
nonlinear relationships.
Correlations Show (only)
Two-Dimensional Shadows
In the motorpool case, the correlations between Age and Cost, and between
Make and Cost, show precisely what the manager’s two-dimensional tables
showed:
Costs
Mileage
Age
Make
Costs
1.000
0.771
0.023
-0.240
Mileage
0.771
1.000
-0.496
-0.478
Age
0.023
-0.496
1.000
0.164
Make
-0.240
-0.478
0.164
1.000
There’s little linkage directly between Age and Cost.
Fords had higher average costs than did Hondas.
But each of these facts is due to the confounding effect of Mileage!
The pure effect of each variable on its own is only revealed in the mostcomplete model.
Potential for Misuse
(received via email from a former student, all employer references
removed)
“One of the pieces of the research is to identify key attributes that
drive customers to choose a vendor for buying office products.
“The market research guy that we have hired (he is an MBA/PhD from
Wharton) says the following:
“‘I can determine the relative importance of various attributes that
drive overall satisfaction by running a correlation of each one of them
against overall satisfaction score and then ranking them based on the
(correlation) coefficient scores.’
“I am not really certain if we can do that. I would tend to think we
should run a regression to get relative weightage.”
Customer Satisfaction
• Consider overall customer satisfaction (on a 100-point scale) with
a Web-based provider of customized software as the order
leadtime (in days), product acquisition cost, and availability of
online order-tracking (0 = not available, 1 = available) vary.
• Here are the correlations:




Correlations with Satisfaction
leadtime
-0.766
ol-tracking
-0.242
cost
0.097
Customers forced to wait are unhappy.
Those without access to online order tracking are more satisfied.
Those who pay more are somewhat happier.
?????
The Full Regression
Regression: satisfaction
coefficient
std error of coef
t-ratio
significance
beta-weight
constant leadtime
192.7338
-6.8856
16.1643
0.5535
11.9234 -12.4391
0.0000%
0.0000%
-1.0879
standard error of regression
coefficient of determination
adjusted coef of determination
cost
ol-track
-1.8025
8.5599
0.3137
4.0729
-5.7453
2.1017
0.0000%
4.0092%
-0.4571
0.1586
13.9292
75.03%
73.70%
Customers dislike high cost, and like online order tracking.
Why does customer satisfaction vary? Primarily because
leadtimes vary; secondarily, because cost varies.
Reconciliation
satisfaction
leadtime
cost
ol-tracking
satisfaction
1.000
-0.766
-0.097
-0.242
leadtime
-0.766
1.000
-0.543
0.465
cost
-0.097
-0.543
1.000
-0.230
ol-tracking
-0.242
0.465
-0.230
1.000
• Customers can pay extra for expedited service (shorter
leadtime at moderate extra cost), or for express service
(shortest leadtime at highest cost)
– Those who chose to save money and wait longer ended up
(slightly) regretting their choice.
• Most customers who chose rapid service weren’t given
access to order tracking.
– They didn’t need it, and were still happy with their fast
deliveries.
Finally …
• The correlations between the explanatory variables can help flesh out the
“story.”
• In a “simple” (i.e., one explanatory variable) regression:
– The (meaningless) beta-weight is the correlation between the two variables.
– The square of the correlation is the unadjusted coefficient of determination (rsquared).
Regression: Costs
If you give me a correlation, I’ll
interpret it by squaring it and
looking at it as a coefficient of
determination.
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Mileage
364.476942 19.812076
76.8173302 4.54471998
4.7447
4.3594
0.0383%
0.0774%
0.7706
standard error of regression
coefficient of determination
adjusted coef of determination
73.8638412
59.38%
56.26%
A Pharmaceutical Ad
Correlation(anxiety,depression) = 0.7
100
90
80
70
anxiety
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
depression
Diagnostic scores from sample of patients receiving psychiatric care
So, if your patients have anxiety problems, consider prescribing our antidepressant!
Evaluation
• At most 49% of the variability in patients’
anxiety levels can potentially be explained by
variability in depression levels.
– “potentially” = might actually be explained by
something else which covaries with both.
• The regression provides no evidence that
changing a patient’s depression level will
cause a change in their anxiety level.
Association vs. Causality
• Polio and Ice Cream
• Regression (and correlation) deal only with association
– Example: Greater values for annual mileage are typically
associated with higher annual maintenance costs.
– No matter how “good” the regression statistics look, they will
not make the case that greater mileage causes greater costs.
– If you believe that driving more during the year causes higher
costs, then it’s fine to use regression to estimate the size of the
causal effect.
• Evidence supporting causality comes only from controlled
experimentation.
– This is why macroeconomists continue to argue about which
aspects of public policy are the key drivers of economic growth.
– It’s also why the cigarette companies won all the lawsuits filed
against them for several decades.
Structural Variations
• Interactions
– When the effect of one explanatory variable on the
dependent variable depends on the value of another
explanatory variable
• The “trick”: Introduce the product of the two as a new
artificial explanatory variable. Example
• Nonlinearities
– When the impact of an explanatory variable on the
dependent variable “bends”
• The “trick”: Introduce the square of that variable as a new
artificial explanatory variable. Example
Interactions: Summary
• When the effect (i.e., the coefficient) of one explanatory
variable on the dependent variable depends on the value
of another explanatory variable
– Signaled only by judgment
– The “trick”: Introduce the product of the two as a new artificial
explanatory variable. After the regression, interpret in the
original “conceptual” model.
– For example, Cost = a + (b1+b2Age)Mileage + … (rest of model)
– The latter explanatory variable (in the example, Age) might or
might not remain in the model
– Cost: We lose a meaningful interpretation of the beta-weights
Examples from the Sample Exams
Caligula’s Castle:
Regression: Revenue
coefficient
constant
Age
-1224.84 62.37502
Age2
Sex
Direct
Indirect SexInd
-0.5201 -121.899 1.992615 0.85276 1.43767
Revenuepred = -1224.82 + 62.37Age – 0.5201Age2 – 121.9Sex + 1.99Direct + (0.8527+1.4377Sex)Indirect
revenue / $ incentive
direct
indirect
Men (Sex=0)
$1.99
$0.85
Women (Sex=1)
$1.99
$2.29
Give direct incentives (house chips, etc.)
to men
Give indirect incentives (flowers, meals)
to women
The Age effect on Revenue is greatest at
Age = -(62.37)/(2(-0.5201)) = 59.96 years
Examples from the Sample Exams
Hans and Franz:
Regression: CustSat
coefficient
constant
Wait
Wait2
Size
Franz? SizeFranz?
84.4016931 -0.8666595 -0.0556165 -5.6022949 -40.084506 8.77474654
CustSatpred = 84.40 – 0.8667Wait – 0.0556Wait2 – 5.602Size + (-40.0845+8.7747Size)Franz?
Set Franz? = 0 (assign Hans) when the
party size is < 40.0845/8.7747 = 4.568
Customers’ anger grows more quickly the longer they wait
Structural Variations
• Interactions
– When the effect of one explanatory variable on the
dependent variable depends on the value of another
explanatory variable
• The “trick”: Introduce the product of the two as a new
artificial explanatory variable. Example
• Nonlinearities
– When the impact of an explanatory variable on the
dependent variable “bends”
• The “trick”: Introduce the square of that variable as a new
artificial explanatory variable. Example
Nonlinearity: Summary
• When the direct relationship between an explanatory variable
and the dependent variable “bends”
– Signaled by a “U” in a plot of the residuals against an explanatory
variable
– The “trick”: Introduce the square of that variable as a new artificial
explanatory variable: Y = a + bX + cX2 + … (rest of model)
– One trick can capture 6 different nonlinear “shapes”
– Always keep the original variable (the linear term, with coefficient
“b”, allows the parabola to take any horizontal position)
– c (positive = upward-bending parabola, negative = downwardbending)
– -b/(2c) indicates where the vertex (either maximum or minimum) of
the parabola occurs
– Cost: We lose a meaningful interpretation of the beta-weights
Examples from the Sample Exams
Caligula’s Castle:
Regression: Revenue
coefficient
constant
Age
-1224.84 62.37502
Age2
Sex
Direct
Indirect SexInd
-0.5201 -121.899 1.992615 0.85276 1.43767
Revenuepred = -1224.82 + 62.37Age – 0.5201Age2 – 121.9Sex + 1.99Direct + (0.8527+1.4377Sex)Indirect
revenue / $ incentive
direct
indirect
Men (Sex=0)
$1.99
$0.85
Women (Sex=1)
$1.99
$2.29
Give direct incentives (house chips, etc.)
to men
Give indirect incentives (flowers, meals)
to women
The Age effect on Revenue is greatest at
Age = -(62.37)/(2(-0.5201)) = 59.96 years
Examples from the Sample Exams
Hans and Franz:
Regression: CustSat
coefficient
constant
Wait
Wait2
Size
Franz? SizeFranz?
84.4016931 -0.8666595 -0.0556165 -5.6022949 -40.084506 8.77474654
CustSatpred = 84.40 – 0.8667Wait – 0.0556Wait2 – 5.602Size + (-40.0845+8.7747Size)Franz?
Set Franz? = 0 (assign Hans) when the
party size is < 40.0845/8.7747 = 4.568
Customers’ anger grows more quickly the longer they wait
Sample Datasets
• Four datasets continuing to review material
from Session 2, with some added modeling
issues.
• Two very thorough sample exams.
– One based on Harrah’s success in understanding
its patrons
– One based on a restaurateur comparing
maitres d’hotel, with a 90-minute prerecorded
Webex tutorial

similar documents