Introduction to R Workshop in Methods and Indiana Statistical

Report
Welcome to the R intro Workshop
Before we begin, please download the
“SwissNotes.csv” and “cardiac.txt” files from the
ISCC website, under the R workshop (more info).
www.iub.edu/~iscc
Introduction to R
Workshop in Methods from the Indiana Statistical Consulting
Center
Thomas A. Jackson
February 15, 2013
Overview
The R Project for Statistical Computing
http://cran.r-project.org
“R is a language and environment for statistical
computing and graphics. It is a GNU project which is
similar to the S language and environment which was
developed at Bell Laboratories (formerly AT&T, now
Lucent Technologies) by John Chambers and Colleagues.
R can be considered as a different implementation of S.
There are some important differences, but much code
written for S runs unaltered under R.”
- Description from CRAN Website
Benefits
R…
• is free
• is interactive: we can type something in and work
with it
▫ How we analyze data can be broken into small steps
• is interpretative: we give it commands and it
translates them into mathematical procedures or
data management steps
• can be used in a batch: nice because it is
documented
• is a calculator: it is unlike other calculators
though because you can create variables and objects
Let’s Get R Started
• How to open R
→ Start Menu
→ Programs
→ Departmentally Supported
→ Stat/Math
→R
Graphical User Interface (GUI)
Three Environments
• Command Window (aka Console)
• Script Window
• Plot Window
Command Window Basics
To quit: type q()
Save workspace image? Moves from memory to harddrive
Storing variable in memory
• <- , -> , or =
• a<- 5 stores the number 5 in the object “a”
• pi -> b stores the number π= 3.141593 in “b”
• x = 1 + 2 stores the result of the calculation (3) in “x”
• “=“ requires left-hand assignment
Try not to overwrite reserved names such as t, c, and pi!
Command Window Basics
Printing to output
• Calculations that are not stored print to output
>3+5
[1] 8
• Type name to view stored object
>a
[1] 5
• Use print()
> print(a)
[1] 5
View objects in workspace
• objects() or ls()
Command Window Basics
Clearing the console (command window)
• Mac: Edit → Clear Console
• Windows: Edit → Clear Console
or
• Mac: Alt + Command + L
• Windows: Ctrl + L
Removing variables from memory
• rm() or remove()
> x <- 4
> rm(x)
• rm(list = ls()) remove all variables
Script Window Basics
Saving syntax (code)
• Mac: File → New
• Windows: File → New Script
Documenting code: # Comments out everything
on line behind
Running code from Script Window
• Mac: Apple + Enter
• Windows: F5 or Ctrl + r
Working Directory
Obtaining working directory
• getwd()
• Mac: Misc → Get Working Directory
• Windows: File → Change dir...
Changing working directory
• setwd()
• Mac: Misc → Change Working Directory
• Windows: File → Change dir...
Path Names
Specify with forward slashes or double
backslashes
Enclose in single or double quotation marks
Examples
• setwd(“C:/Program Files/R/R-2.6.1”)
• setwd(‘C:\\Program Files\\R\\R-2.6.1’)
R Help
Helpful commands
• If you know the function name: help() or ?
> help(log)
> ?exp
• If you do not know the function name:
help.search() or ??
> help.search(“anova”)
> ??regression
Documentation
Elements of a documentation file
• Function{Package}
• Description
• Usage: What your code should look like, “=“
gives default
• Arguments: Inputs to the function
• Details
• Value: What the function will return
• See Also: Related functions
• Examples
Online Resources
•
•
•
•
•
•
•
CRAN Website: http://cran.r-project.org/
R Seek: http://www.rseek.org/
Quick-R tutorial: http://www.statmethods.net/
R Tutor: http://www.r-tutor.com/
UCLA: http://www.ats.ucla.edu/stat/r/
R listservs
Google
Google tip: include “[R]” (instead of just “R”) with
search topic to help filter out non-R websites
Additional Packages
Over 2,500 listed on the CRAN website!
• Use with caution
• Initial download of R: base, graphics, stats, utils
1) Installing a package:
• Mac: Packages & Data → Package Installer
Use Package Search to locate and press ‘Install Selected’
• Windows: Packages → Install Packages
Locate desired package and press ‘OK’
• install.packages(“MASS”)
2) Using an installed package:
You MUST call it into active memory with library()
> library(MASS)
Data Structures
R has several basic types (or “classes”) of data:
• Numeric - Numbers
• Character – Strings (letters, words, etc.)
• Logical – TRUE or FALSE
• Vector
• Matrix
• Array
• Data Frame
• List
NOTE: There are other classes, but these are most
common. Understanding differences will save you some
headache.
Data Structures
• Find class of data
• Unknown class: class()
• Check particular class: is.“classname”()
> a <- 5
> class(a)
[1] “numeric”
> is.character(a)
[1] FALSE
Change class: as.classname()
> as.character(a)
[1] “5”
Vectors
Combine items into vector: c()
> c(1,2,3,4,5,6)
[1] 1 2 3 4 5 6
Repeat number of sequence of numbers: rep()
> rep(1,5)
[1] 1 1 1 1 1
> rep (c(2,5,7), times = 3)
[1] 2 5 7 2 5 7 2 5 7
Vectors
Sequence generation: seq()
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5, by = .5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Try 1:10 or 10:1
Matrices
Create matrix: matrix()
• 6 x 1 matrix: matrix(1:6, ncol = 1)
• 2 x 3 matrix: matrix(1:6, nrow =2, ncol =3)
• 2 x 3 matrix filling across rows first: matrix(1:6,
nrow = 2, ncol = 3, byrow = TRUE)
Create matrix of more than two dimensions
(array): array()
Lists
Create a list: list()
• Holds vectors, matrices, arrays, etc. of varying lengths
• Objects in the list can be named or unnamed
> list(matrix(0, 2, 2), y = rep(c(“A”, “B”), each = 2))
[[1]]
[,1] [,2]
[1,]
0
0
[2,]
0
0
$y
[1] “A” “A” “B” “B”
Data Frame: specialized list that holds variables of same length
Data Frames
Create a data frame: data.frame()
• Like a matrix, holds specified number of rows and columns
> x <- 1:4
> y <- rep(c(“A”, ”B”), each = 2)
> data.frame(x,y)
x y
1 1 A
2 2 A
3 3 B
4 4 B
• Unnamed variables get assigned names
> data.frame(1:2, c(“A”, “B”))
X1.2 c..A….B..
1
1
A
2
2
B
Basic Operations
•
•
•
•
•
Arithmetic: +, -, *, /
Order of operations: ()
Exponentiaition: ^, exp()
Other: log(), sqrt
Evaluate standard Normal density curve,
at x = 3
> x <- 3
> 1/sqrt(2*pi)*exp(-(x^2)/2)
[1] 0.004431848
Vectorization
R is great at vectorizing operations
• Feed a matrix or vector into an expression
• Receive an object of similar dimension as output
For example, evaluate
at x = 0,1,2,3
> x <- c(0,1,2,3)
> 1/sqrt(2*pi)*exp(-(x^2)/2)
[1] 0.39842280 0.241970725 0.053990967
0.004431848
Logical Operations
• Compare: ==, >, <, >=, <=, !=
> a <- c(1,1,2,4,3,1)
> a == 2
[1] FALSE FALSE TRUE FALSE FALSE
FALSE
• And: & or &&
• Or: | or ||
• Find location of TRUEs: which()
> which(a == 1)
[1] 1 2 6
Subsetting
> a <- 1:5
> b <- matrix(1:12,nrow = 3)
Use Square brackets []
• Pick range of elements: a[1:3]
• Pick particular elements: a[c(1,3,5)]
• Do not include elements: a[-c(1,4)]
Subsetting (cont.)
Use commas in more than on dimension (matrices
& data frames)
• Pick particular elements: B[1:2,2:4]
• Give all rows and specified columns: B[,1:2]
• Give all columns and specified rows: B[1:2,]
• Note: B[2] coerces into a vector then gives
specified element
Reading External Data Files
SwissNotes.csv Data set
• Complied by Bernard Flury
• Contains measurements on 200 Swiss Bank
Notes
• 100 genuine and 100 counterfeit notes
Reading External Data Files (cont.)
Most general function: read.table()
read.table(file,header=FALSE,sep = “”,…)
•
•
•
•
Creates a data frame
File name must be in quotes, single or double
File name is case sensitive
Include file name extension if data not in working directory
> read.table(“C:/Users/jacksota/Desktop/SwissNotes.csv”, T,“,”)
Don’t know the file extension? Try: file.choose()
> read.table(file.choose(), header = TRUE, sep = ”,”)
• sep defines the separator, e.g. “,” or “\t” or “”
• header indicates variable names should be read from first row
Reading External Data Files
For comma delimited files: read.csv()
For tab delimited files: read.delim()
For Minitab, SPSS, SAS, STATA, etc. data:
foreign package
• Contains functions to read variety of file formats
• Functions operate like read.data()
• Contains functions for writing data into these
file formats
Data Frame Hints
• Identify variable names in data frame: names()
> data1 <- read.table(“SwissNotes.csv”, sep=“,”, header =TRUE)
> names(data1)
[1] “Length” “LeftHeight” “RightHeight” “LowerInner.Frame”
[5] “UpperInner.Frame” “Diagonal” “Type”
Assign name to data frame variables
> names(data1) <- c(“Length”, “LeftHeight”, “RightHeight”,
“LowerInner..Frame”, “UpperInner.Frame”, “Diagonal”, “Type”)
Note: names are strings and MUST be contained in quotes
Data Frame Hints (cont.)
Create objects out of each data frame variable:
attach()
In the Swiss Note data, to refer to Type as its own
object
> attach(data1)
> Type
[1] Genuine
Genuine
Genuine ….
Data Frame Hints (cont.)
Remove attached objects from workspace:
detach()
> detach(data1)
> Type
Error: object “Type” not found
Note: Type is still part of original data frame, but
is no longer a separate object.
plot() function
plot() is the primary plotting function
Calling plot will open a new plotting window
Documentation: ?plot
For complete list of graphical parameters to
manipulate: ?par
plot() function
Let’s visualize the SwissNotes.csv data.
After loading the data into R, attach the data frame using
attach(data).
Let’s try a scatter plot of LeftHeight by RightHeight.
>plot(LeftHeight, RightHeight)
plot() function
Change symbols: Option pch=.
See ?par for details.
>plot(LeftHeight,RightHeight,pch=2)
plot() Function
Change symbol color: Option col=
Specify by number or by name: col=2 or col=“red”
Hint: Type palette() to see colors associated with number
Type colors() to see all possible colors
> plot(LeftHeight, RightHeight, col=“red”)
What types of points can we get?
plot() Function
Change plot type: Option type =
“p” for points
“l” for lines
“b” for both
“c” for lines part alone of “b”
“o” for both overplotted
“h” for histogram like (or high-density) vertical lines
“s” for stair steps
“S” for other steps, see Details below
“n” for no plotting
Plot() Function
Points with lines…works better on sorted list of points
>plot(LeftHeight,RightHeight,type=“o”)
Scatterplots for Multiple Groups
Use plot() with points() to plot different groups in same plot
Genuine notes vs. Counterfeit notes
>plot(LeftHeight[Type==“Genuine”],Rightheight[Type==“Genuine”],
col=“red”)
>points(LeftHeight[Type==“Counterfeit”],RightHeight[Type==“Counterfeit”]
,col=“blue”)
Axis Labels and Plot Titles
The plot() command call has options to
•
•
•
•
Specify x-axis label: xlab = “X Label”
Specify y-axis label: ylab = “Y Label”
Specify plot title: main = “Main Title”
Specify subtitle: sub = “Subtitle”
Axis Labels and Plot Titles
>plot(LeftHeight[Type==”Genuine”],RightHeight[Type==“Genuine”],
col=“red”,main=“Plot of Bank Note Heights”,sub=“Measurements
are in mm”,xlab=“Height of Left Side”,ylab=“Height of Right Side”)
>points(LeftHeight[Type==“Counterfeit”],
RightHeight[Type=“Counterfeit”],col=“blue”)
Legends
 legend(“topleft”,c(“Genuine Notes”,
”Counterfeit Notes”),pch=c(21,21),col=c(“red”,”blue”))
Adding Lines
To add straight lines to plot: abline()
abline() refers to standard equation for a line:
y = bx + a
• Horizontal line: abline(h= )
• Vertical Line: abline(v= )
• Otherwise: abline(a= , b= ) or
abline(coef=c(a,b))
Adding Lines
> abline(coef=c(21.7104,0.8319))
Histograms
Histograms are another popular plotting option.
> hist(Length)
pairs() Function
Using the SwissNote Data
> pairs(swiss)
Boxplots
To create boxplots: boxplot()
Specify one or more variables to plot.
> boxplot(swiss$Length)
> boxplot(swiss[,2:3])
Boxplots
Use a formula specification for side-by-side boxplots.
Note: boxplot() has many options, e.g. notches. See
?boxplot.
> boxplot(Length~Type,notch=TRUE,data=swiss)
Mean or Average
• Mean()
> mean(swiss[,”Length”])
> mean(swiss)
• rowMeans()
> rowMeans(swiss[,1:6])
• colMeans
> colMeans(swiss[,7])
Variability
• Variance: var()
> var(swiss[,”Length”])
> var(swiss)
• Covariance()
> cov(swiss)
• Correlation()
> cor(swiss[,1:6])
Five-number Summary
>summary(swiss[1:3])
Length
Min. :213.8
1st Qu.:214.6
Median :214.9
Mean :214.9
3rd Qu.:215.1
Max. :216.3
LeftHeight
Min. :129.0
1st Qu.:129.9
Median :130.2
Mean :130.1
3rd Qu.:130.4
Max. :131.0
RightHeight
Min. :129.0
1st Qu.:129.7
Median :130.0
Mean :130.0
3rd Qu.:130.2
Max. :131.1
Creating Tables
table() produces crosstabs of factors or categorical variables
Using the cardiac data:
> table(cardiac[,7:9])
, , newMI = 0
chestpain
gender 0 1
F 6 10
M 4 8
, , newMI = 1
chestpain
gender 0 1
F 100 222
M 62 146
Univariate t-tests
t.test() produces 1- and 2-sample (paired or independent) ttests.
• 1-sample t-test
> t.test(x,alternative=“two.sided”,mu=0,conf.level=0.95)
• 2 independent samples t-test
>
t.test(x,y,alternative=“two.sided”,mu=0,paired=FALSE,
conf.level=0.95)
• paired t-test
> t.test(x,y,alternative=“two.sided”,mu=0,paired=TRUE,
var.equal=TRUE,conf.level=0.95)
2 Independent Samples t-test
x: diagonal measurements for Genuine bank
notes
y: diagonal measurements for Counterfeit bank
notes
> x = swiss[Type==“Genuine”,”Diagonal”]
> y = swiss[Type==“Counterfeit”,”Diagonal”]
> t.test(x,y,alternative=“greater”,mu=0,
paired=FALSE,var.equal=TRUE)
2 Independent Samples t-test
> t.test(x,y,alternative=“greater”,mu=0,
paired=FALSE,var.equal=TRUE)
Two Sample t-test
data: x and y
T = 28.9149, df = 198, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than
0
95 percent confidence interval:
1.948864
Inf
sample estimates:
mean of x mean of y
141.517
139.450
Generating Random Numbers
R contains functions for generating random numbers from
many well-known distributions.
Random number from standard normal distribution:
> rnorm(1,mean=0,sd=1)
[1] 0.5308293
Vector of random numbers from uniform distribution:
> runif(3, min=0, max=1)
[1] 0.6578880 0.3261863 0.3093383
To reproduce results: set.seed()
Function Basics
if() statement
> n = rnorm(1)
> if(n < 0){
n = abs(n)
}
if() statement with else()
> n = rnorm(1)
>if (n < 0){
n = abs(n)
} else{n = 0}
Function Basics
for() loop
> temp = rep(0,10)
> for (i in 1:10){
temp[i] = i+1
}
> temp
[1] 2 3 4 5 6 7 8 9 10 11
Function Basics
while() loop
>n=1
> while (n < 10 ){
n = n+1
}
Creating Functions
test.function = function(input arguments){
commands to execute
}
Creating Functions
For example, let’s define a new function average
to find the average of a set of numbers.
average = function(x){
n = length(x)
average = sum(x)/n
print(average)
}
Sourcing
After writing a function in a script file, bring it
into working memory using source().
Source(“pathname/test.function.R”)

similar documents