Descriptive Statistics In SAS

Report
Descriptive Statistics In SAS
Exploring Your Data
Summary Statistics
Before you begin any analyses, you will
want to get a feel for your data. What is
the mean and standard deviation of certain
variables? Are the data normally
distributed? You can use summary
statistics and visual aids, such as
histograms and box plots to help you see
the distribution of your data.
Proc Univariate
PROC UNIVARIATE is a procedure in SAS that
provides summary statistics on any quantitative
variable.
We will create a data set called “demo” which
contains the weights of 57 day care children.
Copy the values from the file
http://www.biostat.umn.edu/~susant/PH6415DATA/
demo.txt and paste them into your SAS Editor
window with the following SAS code:
Example of Proc Univariate
DATA demo;
INPUT weight;
DATALINES;
68
63
…
12
;
RUN;
TITLE 'Proc Univariate';
PROC UNIVARIATE DATA = demo normal plot;
VAR weight;
histogram weight / midpoints = 10 to 80 by 5 normal;
RUN;
Notes about code:
• Include every observation in your code; the
example uses (…) to represent all observations
• normal requests tests for normality
• plot requests stem-and-leaf plots, box plots, and
a normal probability plot
• VAR weight requests that SAS performs PROC
UNIVARIATE only on the variable weight. This
command is useful when you have numerous
variables; if you do not specify which variables to
analyze, SAS will perform PROC UNIVARIATE
on every variable, which generates a lot of
unnecessary output.
• histogram weight requests a histogram for the
variable weight. The options after the / specify
you want the midpoints of each bar in the
histogram to range from 10 to 80 and be 5 lbs.
apart. This is not necessary, but it gives you
more control over the appearance of your
histogram.
• normal, in the histogram statement, requests a
normal curve be drawn over the histogram,
which gives a visual comparison of what the
data should look like if they are normally
distributed.
Run the PROC Univariate!
• Check your Log; are there any warnings or
errors? Do you have all 57 observations?
• Notice that a new window, GRAPH1, has
opened. This contains your histogram with
normal curve.
• By looking at the histogram, you see that the
largest amount of observations falls to the left of
the center of the normal curve. This indicates
that the data may be skewed to the right.
Histogram in SAS
Output Window
• Look at the Output. Can you find the mean,
variance, and standard deviation of weight?
How about the median?
• Notice that the mean (36.68) is larger than the
median (32.0); this is another indication that the
data may be skewed.
• Under the heading “Tests for Normality” you will
see four tests of normality. All have a p-value <
0.05, leading to the conclusion that the data are
NOT normally distributed. (Ho: data are from a
normal distribution; Ha: data are not from a
normal distribution.)
Plots
• The output also includes a stem-and-leaf
plot and a box plot. The stem-and-leaf
plot resembles the histogram in shape.
The box plot shows the mean (+) greater
than the median (the middle *------*), and it
also shows an outlier (0), weight 79, which
could be the extreme observation
responsible for the skewed data.
Plots, continued
• The output also contains a normal
probability plot. This plots the distribution
of the data points we would expect under
normality (+) along the x-axis against the
distribution we actually observe (*) along
the y-axis. If the data are normally
distributed, we should see a y=x line at a
45° angle.
• The plot shows some points do not fall on
the y=x line, again indicating possibly
skewed data.
Normal Probability Plot
Conclusions
• The weights of the 57 day care children do not
appear to come from a normal distribution.
• Proc Univariate is a valuable tool used for
creating summary statistics for the data
• Proc Univariate can be used to generate plots
and graphs of the data, in order to determine
whether the data come from a normal
distribution. It also provides formal tests of
normality by choosing the “normal” option in the
Proc Univariate data step.
• It is often necessary to determine that the data
are normally distributed before analyzing them.

similar documents