### datastories - The Julia Group

```SEEING IS BELIEVING:
Telling stories with statistics – in pictures
We’re failing
Do you see the same thing here?
Gender
Male
Female
Military
--------------
---------
No
943
1,222
Yes
227
72
This is your brain on statistics
Gender
Male
Female
Military
--------------
---------
No
943
1,222
Yes
227
72
The total sample is (roughly) evenly divided by gender.
Subtracting 72 from the 150 one would expect gives a
value of about 80, which squared is 6,400.
It is already obvious this is significant.
Just for closure ..
e
o
(e-o) ^2
((e-o)^2)/e
157
72
7225
46.01910828
142
227
7225
50.88028169
1028
943
7225
7.028210117
1137
1222
7225
6.354441513
110.2820416
Seeing is a learned skill
Statisticians may
see things in a
picture others
don’t
My points
(surprisingly, I do have some)
Data Visualization
•
Graphics do not necessarily stand
alone
Data visualization is all around us.
Visual representation in one context is
often misapplied to another.
Atomic numbers on your socks?
Data visualization needs to ADD
information
Basic Assumptions
• Our audience needs to be taught to read
visual data just as we read numeric data, and
we need to learn to have some discussion
beyond the choices of line graphs vs. pie
charts
You learned to read numbers
YOU NEED TO LEARN TO WRITE PICTURES
Or, to be more specific, you need to explain to others what you see in
pictures
Question + Data > Picture = Story
?
Bad visualization for one question can
be good for another
• Who will win the
election?
• Which regions support
the Democrats?
Poll dataset did not include Hawaii or Alaska
AN EXAMPLE OF PROGRAM EVALUATION
DATA VISUALIZATION BY EXAMPLE
The government is smarter than you
think
(No, I’m serious)
Was the program implemented as planned?
Was the program implemented as planned?
(This was done in JMP)
Did the program work?
GOPTIONS HBY = 2 ;
PROC GPLOT
DATA=wussexampleUNIFORM;
PLOT z_total_post * z_total_pre /
VREF=0 ;
BY group;
NOTE: Regression equation : z_total_post =
0.13379 + 0.776552*z_total_pre.
NOTE: The above message was for the following BY
group: group=CONTROL
NOTE: Regression equation : z_total_post =
1.233616 + 0.578418*z_total_pre.
NOTE: The above message was for the following BY
group: group=EXPERIMENTAL
EQUATIONS IN THE SAS LOG FOR THE
STATISTICIAN IN YOU
Is the intervention successful under all
conditions?
Admittedly, we did not train people while flying on a trapeze
TRAINING WAS ADMINISTERED TO
FOUR COHORTS
Creating the interaction graph
First, in the RESULTS window, type
sgedit on
Creating the interaction graph
First, in the RESULTS window, type
sgedit on
Ods listing sge = on ;
Ods graphics on ;
proc glm data = plots ;
class TestType cohort ;
model z_total = TestType cohort TestType*cohort ;
where group = "EXPERIMENTAL" ;
Click on the sge plot to edit it
Of course, that is kind of like being the smaller midget
ODDLY, THE MOST TIME-CONSUMING
PART OF THIS IS MAKING THE LINES
THICKER
Using SGEDIT to, well, edit
1. Double-click on the .sge
file in the RESULTS window
2. Right-click in the plot area
& select PLOT PROPERTIES
3. Select desired line
thickness
Yes, the TestType*Cohort*Group interaction (F=5.84, p < .0001) AND the
TestType*Group interaction (F=22.92, p < 0001) in the other repeated
measures ANOVA were significant.
LOOKING AT THE LITTLE PICTURE
(Especially true for small samples)
Are these test related?
R=.22
Look!
Another example
• Years of Education as
predictor of gain score
• R-square = .46
• Correlation = .68)
• P <.01.
Now looky here …
Is it a real relationship?
What should we do?
Throw the score out?
Keep the score in?
Something else?
Ignoring my partner …
with the people next to
you
Sometimes outliers are the most
interesting part of your study
PROC CORR
One last example on knowing
Not just telling a story,
having a conversation
PROC FREQ
Custom Map-making
How to plot the largest category in a
frequency distribution
1, 2, 3
1. PROC TABULATE -> output dataset
2. PROC FORMAT
3. Proc GMAP
WHERE IS DEMOCRATIC SUPPORT BASED?
DATA VISUALIZATION IN POLITICAL SURVEYS
DATA VISUALIZATION BY EXAMPLE
PROC TABULATE
DATA= in.VOTE2008 OUT=SummaryVOTE2008 ;
CLASS question3 state ;
TABLE state, question3* RowPctN
;
WARNING: Some observations were discarded when
charting PctN_01. Only first matching observation was
used. Use STATISTIC= option for summary statistics.
proc format ;
value vote
50.01 - 100 = "Obama"
0 - 50
= "McCain" ;
PROC GMAP
DATA = SummaryVOTE2008 map = maps.us ;
ID state ;
CHORO PctN_01 / discrete LEGEND=LEGEND1 ;
ID statement uses the _map_geometry_ variable
that was merged in from the maps.us dataset
to identify the location on the map.
PROC GMAP
DATA = SummaryVOTE2008 map = maps.us ;
ID state ;
CHORO PctN_01 / discrete LEGEND=LEGEND1 ;
Pattern1 c = red ;
Pattern2 c = blue ;
format PctN_01 vote. ;
PROC GMAP
CHORO PctN_01 / discrete LEGEND=LEGEND1 ;
FORMAT PctN_01 vote. ;
CHORO statement uses the first observation and
ignores the others.
Does Race Matter?
PROC GMAP
Vote2008 coded 0 = McCain
1 = Obama
Pctmin = Percentage of residents in voter’s
district from minority groups
PROC GMAP
DATA = wuss map=maps.us ;
ID state ;
area vote2008 / discrete statistic = mean ;
block pctmin / discrete statistic = mean ;
format pctmin rangep. vote2008 voten. ;
The BLOCK statement charts the pctmin
variable. The height of the block will be based
on the value of the variable, but the color will
be determined using the format specified.
mean minority percentage in
districts where Obama voters live is
21% versus 13% for McCain voters
(t= 5.73, p < .0001)
The usefulness of visual data
With one statement, I can change the
percentage of minority & re-run the chart
value rangep
0 - 15 = "0 -15%"
15.01 - 100 = "> 15%%" ;
Decision Trees, ROC & Lift Curves to
Predict Military Service
DATA VISUALIZATION BY EXAMPLE
Speaking of easy, interactive,
graphics
JMP
libname writeout xport
"e\wuss2010\crimes.xpt" ;
proc copy in = readin out
=writeout ;
How to get a SAS .xpt file into
JMP, Step 1
File > Open
DECISION TREE
•
•
•
•
ANALYZE > MODELING > PARTITION
SELECT Y
SELECT X VARIABLES
Click on the SPLIT button
Click on the red arrow at the top left of the
partition window for pull-down options
include ROC and Lift curves.
ROC
• Sensitivity is the percent of true positives, for
example, the percentage of people you
predicted would die who actually died.
• Specificity is the percent of true negatives,
for example, the percentage of people you
predicted would NOT die who survived.
Comparing models
In JMP, use of training and testing
datasets is REALLY easy
EXCLUDE 25% or 50% of the data and
then re-run your analyses with the
excluded sample
A statistician is a person who was good
at math but didn’t have enough
personality to be an accountant ?
It is important that people
believe you
And that’s my story
AnnMaria De Mars
The Julia Group
2111 7th St #8
Santa Monica, CA 90405
[email protected]
(310) 717 -9089
```