### Bi-Variate Regression

```Inferential
Statistics
Inferential Statistics
With inferential statistics you can do the
following:
Determine probability of characteristics of
population based on the characteristics of
your sample. You know what the relationship
“looks like” in the sample. You have the
actual numbers. Inferential statistics tells
you the probability that the same relationship
exists in the population.
Assess the strength of the relationship
between independent and dependent
variables. Inferential statistics allows you to
determine if the relationship is statistically
significant and helps you decide if it is
substantially significant (e.g., is the
relationship strong enough to matter).
Inferential Statistics
Inferential statistics are used to test
hypotheses
Research hypotheses state there is a
relationship between two or more variables in
Initially you do NOT assume there is a
relationship in the population (i.e., null
hypothesis claims there is NO relationship).
You compute a statistic that indicates the
probability that we can reject the null
hypothesis and support the research
hypothesis (e.g., the same relationship we see
in our sample also exists in the population). If it
is 95% probable that the sample relationship
exists in the population, we say it is significant
at the .05 level (5% chance of making an error,
or we say there is also a relationship in the
population from which this sample was drawn).
How Do We Apply What We Learn
From Inferential Statistics?
• BEFORE you use any Intervention, you should
determine if there is evidence that it works.
• For instance, what is the probability that fertilizer will
increase crop yield for farmers?
• BEFORE you work with any group, you may
want to know the characteristics of that group.
• For instance, what proportion of abused women are
eventually killed by their partner? What is the probability
that “risk” increases after they leave the partner?
• BEFORE you make recommendations, you want
to understand the probabilities of success.
• What is the probability that those who participate in 4-H
will graduate from college? Are they significantly more
likely to graduate than those who do not participate?
Following Questions
• Is the Intervention I am currently using worth my time?
• Does it work with 5% or 95% of program participants?
Are participants more likely than general population to
reach goals?
• What factors are most important when attempting to
increase effectiveness of intervention?
• If I use a second intervention, does it increase success by
10%, 15%, or 60%?
• Are characteristics of program participants important?
• Policy Implications - Is it worth the tax payer’s dollars?
• Is this a Spurious Relationship?
• Is there a difference between groups receiving
intervention and those not receiving it
• Is it large enough to warrant use of limited resources?
(substantial significance)
• Is it large enough to argue that this intervention “works”
in different settings/situations?
Example of Applying Information
Learned From Inferential Statistics
You implement an “on-line” program to improve communication
amongst young married couples.
 Using measurement of long term objectives, is it successful?
• What percent of my respondents stayed married? How
much lower is the divorce rate for those who participated
than for general population?
 What factors are most important?
• Does required payment increase success?
• Do older/younger couples respond more
effectively to this counseling?
 Policy Implications - Is it worth Tax Payer’s Dollars?
• Is this relationship spurious (i.e., proactive individuals are
more likely to seek intervention and also more likely to have
good marriages)?
• Is the intervention “cost effective”? Is difference big
enough to matter?
• Is there evidence that this program could work in
other settings/situations or with other couples?
Correlation and
Regression
Stating Hypothesis for
Regression
• Null Hypothesis
• There is not relationship between any of the
independent variables and the dependent variable
• Technically, all of the slopes are zero
• Research Hypothesis
• There is a relationship between at least one of the
independent variables and the dependent variable
• Technically, at least one of the slopes are zero
• This relationship could be positive or negative
Inferential Statistics
First consider bi-variate statistics
Pearson Correlation
When is it used?
When you have a continuous independent
variable and a continuous dependent variable.
How do you interpret it?
When the probability associated with the ___
statistics is .05 or less then you can assume
there is a relationship between the dependent
and independent variable
For instance you may want to know if the
number of hours participants spend in your
program is positively related to their scores on
school exams
* NOTE The Pearson Correlation and Bi-variate Regression are
very similar
Inferential Statistics
First consider bi-variate statistics
Bi-variate Regression
When is it used?
When you have a continuous independent variable
and a continuous dependent (outcome) variable
For instance, you may want to know if the
number of hours participants spend in your
program is positively related to their scores on
school exams
How do you interpret it?
When the probability associated with the Fstatistic is .05 or less then you can assume there
is a relationship between the dependent and the
independent variable
• NOTE The Pearson Correlation and Bi-variate Regression
are very similar
Pearson Correlation
 Consists of a continuous independent and a
continuous dependent variable (i.e., X and Y)
 A Pearson correlation coefficient is used to estimate
the strength of the relationship between X and Y in the
population
 A Pearson correlation coefficient ranges from -1 to +1
 The closer to -1 or +1 it is, the stronger the
relationship between X and Y, and the lower the
probability that we would make a mistake if we
claimed there is a relationship between X and Y in
the population
 A scatter plot can give a visual representation of the
relationship between X and Y
 A scatter plot shows all of the data points/plots
and their relationship, using an X and Y axis
 On the following slide, respondents’ scores on
BETA and SAT were plotted so that there is one
data point for someone who scored 1000 on the
SAT and 12 on the BETA.
Bi-variate scatterplot showing a
strong positive relationship
If all of the data points were on the regression line, then the
correlation coefficient would be 1. This would indicate that if we
know a person’s score on the SAT we can predict their score on
the BETA 100% of the time.
Bi-variate scatterplot showing a
strong negative/inverse relationship
.
The regression line or slope indicates where the data points
would be if you could predict Y after knowing X 100% of the
time. It is the “predicted” Y.
Correlation Matrix
The following slide contains a computer generated
correlation matrix. A correlation matrix can provide
the following information:
 Strength of the relationship between any two of
the variables
 The probability that you would make a mistake if
you claimed any two variables are related in the
population
 At the top of the correlation matrix, the following
information is reported:
 The mean of each continuous variable
 The sample size
 The standard deviation of each continuous
variable
 The range of scores for each continuous
variable
 The standard deviation of each continuous
variable
Pearson Correlation Coefficient
What does it tell us about the strength of the
relationship between X and Y?
Strength of Relationship
r value R2 values
Perfect
1.0
Strong
.8
.64
Moderate
.5
.25
Weak
.2
.04
No Relationship
0
0
Weak
- .2
- .04
Moderate
- .5
- .25
Strong
- .8
- .64
Perfect
-1.0
1.0
-1.0
Strength of relationship (r or R2)
 The closer to 1 or -1 the R and R2 are, the stronger
the relationship. .
Significance
 The stronger the relationship the more likely it is
significant.
Correlation Matrix
SR90 = number of men per every 100 women
TPOV90 = % of people living in poverty
FHH = % of female headed households
EMPMAL = % of males employed
EMPFEM90 = % of females employer
The first number in the matrix (marked by the maroon textbox) is the
correlation coefficient. It indicates the strength of the relationship.
The second number is the probability. It must be .05 or less if you are
to generalize to the population.
There may be a third number in the matrix. It would indicate the
sample size.
Bi-Variate Regression
This is a bi-variate regression printout. It focuses specifically on the
relationship between two of the variables (e.g. FHH90 and TPOV90)
reported in the matrix on the previous slide. Note that the standardized
estimate is the same as the correlation coefficient. If you square this
number (.51215714) you would get .2623 (the R square). If you square
the standardized estimate you always get the R-square. It is the percent
increase in you ability to predict Y if you know X. In this example, your
ability to predict the poverty rate (TPOV90) in a city increases by 26% if
you know the percent of female headed households in that city..
The Prob>F is .0001 This indicates this relationship is significant.
We are more than 99% sure that this relationship exists in the
population.
This is a SAS printout of a Pearson Correlation
Matrix. This matrix reports the relationship between
3 continuous variables (i.e., GPR, grade in school
and number of times the student has skipped class).
Types of Regression
 Bi-Variate Regression
 Continuous dependent variable
 Continuous independent variable
 Relationship between the two can be negative or
inverse, positive, linear or curvilinear.
 Multiple Regression
 Continuous dependent variable
 You use multiple independent variables to predict
a continuous dependent variable
 For instance, you could use number of hours
participating in the program, score on attitude
index and age to predict success in school
(i.e., GPR)
 A variation – you use one or more continuous
independent variables and one categorical
variable to predict a dependent variable—The
categorical variable can have only two categories
(i.e., male or female)
 For instance, you would use gender, number
of hours participating n the program, score on
attitude index and age to predict success in
school (i.e., GPR)
Regression – continued
 Logit Regression
 Categorical dependent variable (two
categories)
 You use continuous independent variables to
predict the probability of falling into one
category or another
 For instance, how does number of hours
studying for exams, age, and number of
classes skipped, influence the probability
that a student will graduate from high
school?
 Graduation is measured simply as did
Regression Printout
This is a copy of a
multiple regression
printout and includes a
brief explanation of the
numbers reported.
Interpreting a regression printout
Pr > F is <.000l indicating we can reject the null hypothesis and at least
one independent variable is significantly related.
R-Square is .5187 indicating that our ability to predict posself (self-esteem
score) increases by 51% if we know the value of all of the independent
variables.
Pr . is less than .0001 for grades (abc) and how much like they other
students (likestu) indicating that these are the independent variables that
are related to posself (self-esteem score).
Interpreting Parameter Estimates
• Parameter estimates are how much Y
changes for every one unit change in X.
From printout on previous slide, we see
that for every grade increase (i.e., from C
to B) then Posself (self esteem score)
increases by .69 or 2/3 of a point.
• If the independent variable is a dummy
variable then we interpret it slightly
differently. It is how much Y changes
when we go from one category to the
other. From printout on previous slide we
see that as we go from the category of
male (coded 0) to female (coded 1) then
Posself decreases by .21688.
Interpreting Parameter Estimates
• Caution – When you interpret a
parameter estimate, you must consider
how you measured the X variables. If
your parameter estimate is 1 and you
measured money in \$10,000 increments,
then for every \$10,000 you spend on your
child, their SAT score would increase by
1. If your parameter estimate is 1 and
you measure money in dollars, then for
every dollar, their SAT score would
increase by 1. \$10,000 would result in an
increase of 10,000 on their SAT.
Predicting Y
 Why and when do we predict Y?
 Explain results to a nonacademic audience
 Explain regression in an interesting way
 How do we predict?
 First generate regression printout
 THEN use the prediction formula –
Y=a+b1x1+b2x2+b3x3+…….Where:
 Y=the intercept or constant
 b=slope or parameter estimate which tells
you the change in Y for every one unit
change in X
 X=value of each independent variable that
you select using the codebook
 Y=the predicted value of your dependent
variable as predicted by the combination
of X variables
Now You can do it
Contact Information
Dr. Carol Albrecht
USU Extension
Assessment Specialist
[email protected]
979-777-2421
```