Computing for Research I
Spring 2012
Introduction to Stata
February 14
Primary Instructor:
Elizabeth Garrett-Mayer
• Stata is a powerful statistical package with
– smart data-management facilities
– a wide array of up-to-date statistical techniques,
– an excellent system for producing publication-quality
• Stata is fast and easy to use
• Current version is Stata 11.
• Stata vs. Stata SE
– “standard” stata can handle up to 2047 variables
– SE can handle 32766 variables
– Number of observations is limited by your computer (up to
2 billion!)
Stata Interface
• Multiple Windows
• Other windows
Data editor
Data viewer
Stata Interface
Customizable windows
Can be resized
Edits to preferences are ‘remembered’
You can save (then load) different preferences.
Command line driven
But more recently, drop-down menu
Important Details
case sensitive!
return means ‘run’. there is no little running man to click.
you cannot run commands if your data editor is open
you need to ‘clear’ data before you bring in more data
you can only have one dataset active at a time
Save yourself some typing (and errors)
– Utilize the variables window
– Utilize the ‘command’ window
• abbreviations work for commands and variable names!
– d instead of describe
– case instead of caseid
– NOT always, but if they uniquely identify variable name or command, they
– Also true for some options.
– See Stata help files for how short you can go on abbreviations
• The most important part
• Two interactive options:
– help ‘command’
– help ‘search’
• Also LARGE pdfs that link from help files
• Plus:
link to Stata
command line help
No data?
• There are lots of things you can do without
data in stata!
• “immediate” commands
– An immediate command is a command that
obtains data not from the data stored in memory
but from numbers typed as arguments.
– Immediate commands, in effect, turn Stata into a
glorified hand calculator.
Some immediate commands
twoway scatteri
twoway pci
twoway pcarrowi
Binomial probability test
Tables for epidemiologists; see [ST] epitab
Confidence intervals for means, proportions, counts
One- and two-sample tests of proportions
Sample size and power determination
Variance comparison tests
Symmetry and marginal homogeneity tests
One- and two-way tables of frequencies
Mean comparison tests
Paired-coordinate plot with spikes or lines
Paired-coordinate plot with arrows
see ‘help immediate’ for more information
Some examples
display 4.1–1.96*0.3
tabi 100 34 \ 17 294
tabi 100 34 \ 17 294, col
tabi 100 34 \ 17 294, col row cell chi
cci 100 34 17 294
cci 100 34 17 294, exact
sampsi 0.2 0.5
But most of the time, we have datasets
• *.dta files are stata datasets
• To open:
– Option 1: use the “use” command:
• use "I:\MUSC Oncology\Cunningham,
– Option 2: menu-driven open
• File  Open…
• If you use Option 2, the associated command will
appear in your results window AND in your review
• If you use Option 2, consider cutting and pasting
command into your ‘do’ file for next time..
Other types of data?
• Stata can import
– ASCII files
– Sas export
– and a few others (that I have never heard of)
• Two options:
– menu-driven: File Import….
– insheet command can be used for ascii files
• insheet using sampledata.csv, comma
• insheet using sampledata.csv, tab
– you can use any separator (use
delimiter(“char”) option)
Two notes on opening files
• if you use command line, you will have to either add
clear at the end of the line to clear a current data set, or
type clear as a command prior to opening the new
– insheet using sampledata.csv, comma clear
– clear
– insheet using sampledata.csv, comma
• you can use the cd command to tell Stata where to browse
for your file(s), instead of giving long path names. This is
particularly helpful if you are merging files from the same
– cd “I:\Classes\StatComputingI”
Example: SC breast cancer registry
data from 2004
• All diagnoses of breast cancer in SC are
• N = 2633; 55 variables
• Demographic and clinical information
• Let’s read it in and explore it
– use cd
– use insheet
– use ‘use’
Exploring your dataset
describe (can be abbreviated ‘d’)
– a very good idea to make sure things look right
– tell you about types of variables, number of observations and number of variables
– summary per variable
– useful for seeing number of uniques and missings
– statistical summary (N, mean, SD, etc.)
– only works on numerically coded variables
– sum, detail
– similar to codebook.
– provides rough histogram and neg, pos, missing
– all of these can be used with or without a varlist (e.g. sum race age)
– to ‘quit’ a long command, type ‘q’ and it will stop sending output to results window
Exploring your dataset
Open dataset in editor or browser
Difference? edit capabilities
Allows you to sort
Variables manager (can access from viewer or
main toolbar)
– allows you to add labels simply
– includes coding
• Categorical variables can be summarized using
tabulate (tab) or tabulate
– tab race
– table race
• list can help with a small dataset, or to look
at a subset of the dataset
– list race age if age<30
• Can also sort at command line
– sort age
Interactive command line driven?
• Well, there is a little running man, afterall!
open a ‘do’ file
enter all of your commands in the do file
you can select one or more to run at a time
SAVE your do file!!!
• Window  Do File Editor
• how to include comments? * or /*…*/
* this is how we can make a table of race and ER
tab race ercat
/* our table looks very nice.
we should really make pretty tables all the time */
Do file of our commands so far
* slide 13: reading in data
cd "I:\Classes\StatComputingI"
insheet using "SCBC2004.csv", comma
use SCBC2004.dta
* slide 14: exploring our dataset
* use d or describe
d ercat
codebook dodyr
sum ercat
codebook ercat
* slide 16: more exploration
tab race
table race
list race age if age<30
sort age
What about the output?
Sometimes you want to have a file that shows the results
Useful to share with investigators(?)
Nice to have output saved
My preference? keep a really good ‘do’ file and rerun it.
Log file setup steps:
– File  Log  Begin
– analyze data, etc.
– File  Log  Suspend (or End)
• Options for text (.log) or formatted (.smcl) files
– *.log can be opened in text editor
– *.smcl can only be opened in stata but looks nicer (and can be
printed from stata)
Getting stuff out of Stata
• Stata can be good for data management
• I prefer it to R
– step 1: data management in Stata
– step 2: write ‘clean’ file from Stata to csv
– step 3: read clean file into R
• Exporting:
– menu-driven: File  Export
– command line:
outsheet [varlist] using “file.csv”, comma
**for command line, may need “replace” as an option if you
already have a file of the same name you want to replace.
Saving Stata Data
• File  Save or Save as
• Command line:
– save “filename”, replace
– save filename
– save filename.dta
– .dta will be added
– replace may be needed or not
What if you don’t want to save or export everything?
• You can use keep and drop commands to keep or drop
observations or variables before exporting/saving
• Want analyze ER, PR status, stage, age and grade in
African American women.
– drop if race==1
– keep ercat prcat stagen age grade
• These observations and variables are GONE from
Stata’s memory
• If you want them back, you need to reload the original
• BE CAREFUL: do NOT drop variables or observations
and then overwrite original data!
Other options for subsetting
• by: performs command by categories
– by race, sort: sum age
– bysort ercat prcat: sum age
• if: performs command in a category/range
– tab ercat if stagen>1
– tab ercat if graden~=.
• Combine them:
– bysort ercat prcat: sum age if
ercat<9 & prcat<9
Working with variables
• new variables can be created with the
‘generate’ command (or just ‘gen’)
• Example: grade has 4 levels
. tab graden
graden |
------------+----------------------------------1 |
2 |
3 |
4 |
------------+----------------------------------Total |
• We want to create high vs low grade variable
Several approaches
• gen highgrade = 1 if graden>2
• replace highgrade = 0 if graden<3
• gen highgrade=cond(graden>2,1,0)
• replace highgrade = . if graden==.
• Note well: Check coding of missing values!!
Extensions to generate
• ‘egen’
• Same example: egen has a function ‘cut’
that can cut a continuous variable at a list of
• categories are defined by < each breakpoint
egen highgrade=cut(graden), at(-1,3,5)
egen highgrade=cut(graden), at(-1,3,5) icodes
• use it for transformations
– gen y = log(x)
– gen y = x^2
• generate random variables
– gen z1 = uniform()
– gen z2 = 2 + 2*runiform()
• generate ascending observation id
– gen id= _n
– bysort county: gen countyid=_n
Example of using these commands together
• We want to randomly select 10 women from
each of 46 counties in SC
• Step 1: generate random numbers
– gen z1=runiform()
• Step 2: sort and number women within counties
– sort county z1
– by county : gen countyid=_n
• Step 3: keep only 10 women in county
– drop if countyid>10
Formatting Dates
• Dates do not always maintain formatting, especially
when reading data from csv files
• Two steps: generate and format
• Example stata syntax
– gen newdate=date(datevar, “MDY”)
– format newdate %td
• Stata treats dates as integers (formatting is like labels)
so they can be manipulated
• Month, day and year can be extracted
• Also, see clock
• There are a lot of details that can be found in the help
Reshaping Data
• In Stata there is one command to reshape IF
your data is in the right format.
• From long to wide:
– i indexes the observation (e.g., patient, hospital)
– j indexs the repeats (e.g., year, cycle, visit)
– Also need to list which variables vary by j
Example: ceramide data
• Clinical trial in cancer patients
• Ceramide (et al.) were measured every two cycles
in patients
• Of interest: do changes in ceramide correlate
with outcome (e.g., response, survival)?
• Data provided in long format
i is patient_id
j is cycle
Ceramide, etc. vary per patient
Some variables are constant (and stata can figure it
Reshaping ceramide data
• reshape wide collecteddate frombaselines1p, i(patient)
• reshape long: once Stata reshapes data
in its recent memory, it can reshape again
without any options
Reshaping wide to long
• Much more common
• Many researchers “grow” their datasets by
columns instead of rows
• Formatting needs to be specific
– Variable names must have numeric suffix
– Could require a fair amount of editing
– Depends on how many repeats and variables
there are
Reshaping wide to long
insheet using "ceramide2.csv"
rename cycle1totalceramidelevels totalceramidelevels1
rename cycle1diseasestatus diseasestatus1
rename cycle1c18ceramide c18ceramide1
rename cycle3totalceramidelevels totalceramidelevels3
rename cycle3diseasestatus diseasestatus3
rename cycle3c18ceramide c18ceramide3
rename cycle5totalceramidelevels totalceramidelevels5
rename cycle5diseasestatus diseasestatus5
rename cycle5c18ceramide c18ceramide5
rename cycle3daysfromstart daysfromstart3
rename cycle5daysfromstart daysfromstart5
reshape long daysfromstart diseasestatus totalceramidelevels
c18ceramide , i(patient) j(cycle)
drop if totalcerami==.
replace daysfromstart=0 if cycle==1

similar documents