Report

Welcome to the R intro Workshop Before we begin, please download the “SwissNotes.csv” and “cardiac.txt” files from the ISCC website, under the R workshop (more info). www.iub.edu/~iscc Introduction to R Workshop in Methods from the Indiana Statistical Consulting Center Thomas A. Jackson February 15, 2013 Overview The R Project for Statistical Computing http://cran.r-project.org “R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and Colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.” - Description from CRAN Website Benefits R… • is free • is interactive: we can type something in and work with it ▫ How we analyze data can be broken into small steps • is interpretative: we give it commands and it translates them into mathematical procedures or data management steps • can be used in a batch: nice because it is documented • is a calculator: it is unlike other calculators though because you can create variables and objects Let’s Get R Started • How to open R → Start Menu → Programs → Departmentally Supported → Stat/Math →R Graphical User Interface (GUI) Three Environments • Command Window (aka Console) • Script Window • Plot Window Command Window Basics To quit: type q() Save workspace image? Moves from memory to harddrive Storing variable in memory • <- , -> , or = • a<- 5 stores the number 5 in the object “a” • pi -> b stores the number π= 3.141593 in “b” • x = 1 + 2 stores the result of the calculation (3) in “x” • “=“ requires left-hand assignment Try not to overwrite reserved names such as t, c, and pi! Command Window Basics Printing to output • Calculations that are not stored print to output >3+5 [1] 8 • Type name to view stored object >a [1] 5 • Use print() > print(a) [1] 5 View objects in workspace • objects() or ls() Command Window Basics Clearing the console (command window) • Mac: Edit → Clear Console • Windows: Edit → Clear Console or • Mac: Alt + Command + L • Windows: Ctrl + L Removing variables from memory • rm() or remove() > x <- 4 > rm(x) • rm(list = ls()) remove all variables Script Window Basics Saving syntax (code) • Mac: File → New • Windows: File → New Script Documenting code: # Comments out everything on line behind Running code from Script Window • Mac: Apple + Enter • Windows: F5 or Ctrl + r Working Directory Obtaining working directory • getwd() • Mac: Misc → Get Working Directory • Windows: File → Change dir... Changing working directory • setwd() • Mac: Misc → Change Working Directory • Windows: File → Change dir... Path Names Specify with forward slashes or double backslashes Enclose in single or double quotation marks Examples • setwd(“C:/Program Files/R/R-2.6.1”) • setwd(‘C:\\Program Files\\R\\R-2.6.1’) R Help Helpful commands • If you know the function name: help() or ? > help(log) > ?exp • If you do not know the function name: help.search() or ?? > help.search(“anova”) > ??regression Documentation Elements of a documentation file • Function{Package} • Description • Usage: What your code should look like, “=“ gives default • Arguments: Inputs to the function • Details • Value: What the function will return • See Also: Related functions • Examples Online Resources • • • • • • • CRAN Website: http://cran.r-project.org/ R Seek: http://www.rseek.org/ Quick-R tutorial: http://www.statmethods.net/ R Tutor: http://www.r-tutor.com/ UCLA: http://www.ats.ucla.edu/stat/r/ R listservs Google Google tip: include “[R]” (instead of just “R”) with search topic to help filter out non-R websites Additional Packages Over 2,500 listed on the CRAN website! • Use with caution • Initial download of R: base, graphics, stats, utils 1) Installing a package: • Mac: Packages & Data → Package Installer Use Package Search to locate and press ‘Install Selected’ • Windows: Packages → Install Packages Locate desired package and press ‘OK’ • install.packages(“MASS”) 2) Using an installed package: You MUST call it into active memory with library() > library(MASS) Data Structures R has several basic types (or “classes”) of data: • Numeric - Numbers • Character – Strings (letters, words, etc.) • Logical – TRUE or FALSE • Vector • Matrix • Array • Data Frame • List NOTE: There are other classes, but these are most common. Understanding differences will save you some headache. Data Structures • Find class of data • Unknown class: class() • Check particular class: is.“classname”() > a <- 5 > class(a) [1] “numeric” > is.character(a) [1] FALSE Change class: as.classname() > as.character(a) [1] “5” Vectors Combine items into vector: c() > c(1,2,3,4,5,6) [1] 1 2 3 4 5 6 Repeat number of sequence of numbers: rep() > rep(1,5) [1] 1 1 1 1 1 > rep (c(2,5,7), times = 3) [1] 2 5 7 2 5 7 2 5 7 Vectors Sequence generation: seq() > seq(1,5) [1] 1 2 3 4 5 > seq(1,5, by = .5) [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Try 1:10 or 10:1 Matrices Create matrix: matrix() • 6 x 1 matrix: matrix(1:6, ncol = 1) • 2 x 3 matrix: matrix(1:6, nrow =2, ncol =3) • 2 x 3 matrix filling across rows first: matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE) Create matrix of more than two dimensions (array): array() Lists Create a list: list() • Holds vectors, matrices, arrays, etc. of varying lengths • Objects in the list can be named or unnamed > list(matrix(0, 2, 2), y = rep(c(“A”, “B”), each = 2)) [[1]] [,1] [,2] [1,] 0 0 [2,] 0 0 $y [1] “A” “A” “B” “B” Data Frame: specialized list that holds variables of same length Data Frames Create a data frame: data.frame() • Like a matrix, holds specified number of rows and columns > x <- 1:4 > y <- rep(c(“A”, ”B”), each = 2) > data.frame(x,y) x y 1 1 A 2 2 A 3 3 B 4 4 B • Unnamed variables get assigned names > data.frame(1:2, c(“A”, “B”)) X1.2 c..A….B.. 1 1 A 2 2 B Basic Operations • • • • • Arithmetic: +, -, *, / Order of operations: () Exponentiaition: ^, exp() Other: log(), sqrt Evaluate standard Normal density curve, at x = 3 > x <- 3 > 1/sqrt(2*pi)*exp(-(x^2)/2) [1] 0.004431848 Vectorization R is great at vectorizing operations • Feed a matrix or vector into an expression • Receive an object of similar dimension as output For example, evaluate at x = 0,1,2,3 > x <- c(0,1,2,3) > 1/sqrt(2*pi)*exp(-(x^2)/2) [1] 0.39842280 0.241970725 0.053990967 0.004431848 Logical Operations • Compare: ==, >, <, >=, <=, != > a <- c(1,1,2,4,3,1) > a == 2 [1] FALSE FALSE TRUE FALSE FALSE FALSE • And: & or && • Or: | or || • Find location of TRUEs: which() > which(a == 1) [1] 1 2 6 Subsetting > a <- 1:5 > b <- matrix(1:12,nrow = 3) Use Square brackets [] • Pick range of elements: a[1:3] • Pick particular elements: a[c(1,3,5)] • Do not include elements: a[-c(1,4)] Subsetting (cont.) Use commas in more than on dimension (matrices & data frames) • Pick particular elements: B[1:2,2:4] • Give all rows and specified columns: B[,1:2] • Give all columns and specified rows: B[1:2,] • Note: B[2] coerces into a vector then gives specified element Reading External Data Files SwissNotes.csv Data set • Complied by Bernard Flury • Contains measurements on 200 Swiss Bank Notes • 100 genuine and 100 counterfeit notes Reading External Data Files (cont.) Most general function: read.table() read.table(file,header=FALSE,sep = “”,…) • • • • Creates a data frame File name must be in quotes, single or double File name is case sensitive Include file name extension if data not in working directory > read.table(“C:/Users/jacksota/Desktop/SwissNotes.csv”, T,“,”) Don’t know the file extension? Try: file.choose() > read.table(file.choose(), header = TRUE, sep = ”,”) • sep defines the separator, e.g. “,” or “\t” or “” • header indicates variable names should be read from first row Reading External Data Files For comma delimited files: read.csv() For tab delimited files: read.delim() For Minitab, SPSS, SAS, STATA, etc. data: foreign package • Contains functions to read variety of file formats • Functions operate like read.data() • Contains functions for writing data into these file formats Data Frame Hints • Identify variable names in data frame: names() > data1 <- read.table(“SwissNotes.csv”, sep=“,”, header =TRUE) > names(data1) [1] “Length” “LeftHeight” “RightHeight” “LowerInner.Frame” [5] “UpperInner.Frame” “Diagonal” “Type” Assign name to data frame variables > names(data1) <- c(“Length”, “LeftHeight”, “RightHeight”, “LowerInner..Frame”, “UpperInner.Frame”, “Diagonal”, “Type”) Note: names are strings and MUST be contained in quotes Data Frame Hints (cont.) Create objects out of each data frame variable: attach() In the Swiss Note data, to refer to Type as its own object > attach(data1) > Type [1] Genuine Genuine Genuine …. Data Frame Hints (cont.) Remove attached objects from workspace: detach() > detach(data1) > Type Error: object “Type” not found Note: Type is still part of original data frame, but is no longer a separate object. plot() function plot() is the primary plotting function Calling plot will open a new plotting window Documentation: ?plot For complete list of graphical parameters to manipulate: ?par plot() function Let’s visualize the SwissNotes.csv data. After loading the data into R, attach the data frame using attach(data). Let’s try a scatter plot of LeftHeight by RightHeight. >plot(LeftHeight, RightHeight) plot() function Change symbols: Option pch=. See ?par for details. >plot(LeftHeight,RightHeight,pch=2) plot() Function Change symbol color: Option col= Specify by number or by name: col=2 or col=“red” Hint: Type palette() to see colors associated with number Type colors() to see all possible colors > plot(LeftHeight, RightHeight, col=“red”) What types of points can we get? plot() Function Change plot type: Option type = “p” for points “l” for lines “b” for both “c” for lines part alone of “b” “o” for both overplotted “h” for histogram like (or high-density) vertical lines “s” for stair steps “S” for other steps, see Details below “n” for no plotting Plot() Function Points with lines…works better on sorted list of points >plot(LeftHeight,RightHeight,type=“o”) Scatterplots for Multiple Groups Use plot() with points() to plot different groups in same plot Genuine notes vs. Counterfeit notes >plot(LeftHeight[Type==“Genuine”],Rightheight[Type==“Genuine”], col=“red”) >points(LeftHeight[Type==“Counterfeit”],RightHeight[Type==“Counterfeit”] ,col=“blue”) Axis Labels and Plot Titles The plot() command call has options to • • • • Specify x-axis label: xlab = “X Label” Specify y-axis label: ylab = “Y Label” Specify plot title: main = “Main Title” Specify subtitle: sub = “Subtitle” Axis Labels and Plot Titles >plot(LeftHeight[Type==”Genuine”],RightHeight[Type==“Genuine”], col=“red”,main=“Plot of Bank Note Heights”,sub=“Measurements are in mm”,xlab=“Height of Left Side”,ylab=“Height of Right Side”) >points(LeftHeight[Type==“Counterfeit”], RightHeight[Type=“Counterfeit”],col=“blue”) Legends legend(“topleft”,c(“Genuine Notes”, ”Counterfeit Notes”),pch=c(21,21),col=c(“red”,”blue”)) Adding Lines To add straight lines to plot: abline() abline() refers to standard equation for a line: y = bx + a • Horizontal line: abline(h= ) • Vertical Line: abline(v= ) • Otherwise: abline(a= , b= ) or abline(coef=c(a,b)) Adding Lines > abline(coef=c(21.7104,0.8319)) Histograms Histograms are another popular plotting option. > hist(Length) pairs() Function Using the SwissNote Data > pairs(swiss) Boxplots To create boxplots: boxplot() Specify one or more variables to plot. > boxplot(swiss$Length) > boxplot(swiss[,2:3]) Boxplots Use a formula specification for side-by-side boxplots. Note: boxplot() has many options, e.g. notches. See ?boxplot. > boxplot(Length~Type,notch=TRUE,data=swiss) Mean or Average • Mean() > mean(swiss[,”Length”]) > mean(swiss) • rowMeans() > rowMeans(swiss[,1:6]) • colMeans > colMeans(swiss[,7]) Variability • Variance: var() > var(swiss[,”Length”]) > var(swiss) • Covariance() > cov(swiss) • Correlation() > cor(swiss[,1:6]) Five-number Summary >summary(swiss[1:3]) Length Min. :213.8 1st Qu.:214.6 Median :214.9 Mean :214.9 3rd Qu.:215.1 Max. :216.3 LeftHeight Min. :129.0 1st Qu.:129.9 Median :130.2 Mean :130.1 3rd Qu.:130.4 Max. :131.0 RightHeight Min. :129.0 1st Qu.:129.7 Median :130.0 Mean :130.0 3rd Qu.:130.2 Max. :131.1 Creating Tables table() produces crosstabs of factors or categorical variables Using the cardiac data: > table(cardiac[,7:9]) , , newMI = 0 chestpain gender 0 1 F 6 10 M 4 8 , , newMI = 1 chestpain gender 0 1 F 100 222 M 62 146 Univariate t-tests t.test() produces 1- and 2-sample (paired or independent) ttests. • 1-sample t-test > t.test(x,alternative=“two.sided”,mu=0,conf.level=0.95) • 2 independent samples t-test > t.test(x,y,alternative=“two.sided”,mu=0,paired=FALSE, conf.level=0.95) • paired t-test > t.test(x,y,alternative=“two.sided”,mu=0,paired=TRUE, var.equal=TRUE,conf.level=0.95) 2 Independent Samples t-test x: diagonal measurements for Genuine bank notes y: diagonal measurements for Counterfeit bank notes > x = swiss[Type==“Genuine”,”Diagonal”] > y = swiss[Type==“Counterfeit”,”Diagonal”] > t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE) 2 Independent Samples t-test > t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE) Two Sample t-test data: x and y T = 28.9149, df = 198, p-value < 2.2e-16 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 1.948864 Inf sample estimates: mean of x mean of y 141.517 139.450 Generating Random Numbers R contains functions for generating random numbers from many well-known distributions. Random number from standard normal distribution: > rnorm(1,mean=0,sd=1) [1] 0.5308293 Vector of random numbers from uniform distribution: > runif(3, min=0, max=1) [1] 0.6578880 0.3261863 0.3093383 To reproduce results: set.seed() Function Basics if() statement > n = rnorm(1) > if(n < 0){ n = abs(n) } if() statement with else() > n = rnorm(1) >if (n < 0){ n = abs(n) } else{n = 0} Function Basics for() loop > temp = rep(0,10) > for (i in 1:10){ temp[i] = i+1 } > temp [1] 2 3 4 5 6 7 8 9 10 11 Function Basics while() loop >n=1 > while (n < 10 ){ n = n+1 } Creating Functions test.function = function(input arguments){ commands to execute } Creating Functions For example, let’s define a new function average to find the average of a set of numbers. average = function(x){ n = length(x) average = sum(x)/n print(average) } Sourcing After writing a function in a script file, bring it into working memory using source(). Source(“pathname/test.function.R”)