R: Because the names of other stat programs don’t make sense so why should this one? ALEXANDER C. LOPILATO Outline The three Ws of R: What, Where, and Why Commonly used operators Formatting your data for R Working with data in R Exporting data from R What is R? “R is a language and environment for statistical computing and graphics.” (http://www.r-project.org) It’s a programming language first and a statistical analysis tool second Entirely syntax based – Similar to the SAS and SPSS syntax User can download “packages” which are similar to SPSS Modules Where is R? Available for download at: http://www.r-project.org/ Works on PCs, Macs, and Linux OS Doesn’t require a ton of computer memory (I find that it runs smoother than both SAS and SPSS) Why use R? It’s an open-source project aka FREE! It’s gaining traction in both industry (Google, Facebook, & Kickstarter) and academia It’s combination of programming flexibility and statistical analyses capabilities makes it one of the more powerful data analysis programs out there Commonly used operators <- :Assignment operator # :Comment operator >, <, ==, | :Boolean operators +, -, *, ^ :Mathematical operators Formatting your data for R: A Brief Intro R can read SAS, SPSS, STATA, txt files, and csv files I recommend that you store your data in a csv file R can easily read csv files Csv files can be imported to and exported from SAS and SPSS Other statistical programs can easily read csv files I write all of my code in notepad (more habit than anything else), but R has many different GUIs Formatting your data for R: Three easy steps 1) Turn your data file into a csv file 2) Use the read.csv() function Dataset <- read.csv(‘Dataset location.csv’) 3) Dataset is now a user-defined object (in this particular case it’s a dataframe in R) that contains all of your data Formatting your data for R: Common Mistakes (That I’ve made 100 times over) R cannot read \ (the backslash), thus when you write the location of your dataset you have to use either / or \\ R is case sensitive, so ‘C:\\Dataset.csv’ and ‘C:\\dataset.csv’ are not the same in R speak Always make sure you include the file extension (.csv, .txt, .whatever)! Working with data in R: Things to check I always check the dimensions of my dataset dim(Dataset) – this will return two numbers: row x column. Rows = number of cases and columns = number of variables Check the names of your dataset names(Dataset) Check the descriptive statistics for anything out of the ordinary: colMeans(Dataset[,1:10],na.rm=T) sapply(Dataset[,1:10],sd,na.rm=T) Notice the brackets? Working with data in R: Subsetting your dataset First, begin thinking about your dataset as a matrix Rows = cases and columns = variables Dataset[5,1] means return the observation stored in row 5 column 1 Dataset[,1] means return all of the rows in column 1 Dataset[2,1:5] means return all of the observations in row 2 and columns 1 through 5 Working with data in R: Subsetting your dataset Alternatively, you can reference a column directly by using the $ operator: Dataset$Var1 will return the entire Var1 column from Dataset What if I want to filter by some variable? ds.Female <- Dataset[Dataset$Var11 == ‘Female’,] The above creates a dataframe called ds.Female that filtered out any case where Var11 equaled ‘Male’ Working with data in R: Reverse Coding What do you do if you have some variables that need to be reverse coded? (1 + highest scale value – Variable) is the general formula Dataset$Var12 <- 8 – Dataset$Var10 – This does two things. 1) Creates another column in Dataset labeled Var12 and 2) Sets Var12 equal to 8 – Var10 Check with cor(Dataset$Var10, Dataset$Var12, use=‘complete.obs’) Working with data in R: Internal R Functions mean(Dataset$Var1,na.rm=T) = mean(Dataset[,1],na.rm=T) sd(Dataset$Var4, na.rm=T) min(Dataset$Var5,na.rm=T) and max(Dataset$Var5, na.rm=T) cor(Dataset, use=‘complete.obs’) Working with data in R: Internal R Functions modlm <- lm(Var2 ~ Var3, data=Dataset) Ordinary Least Squares Regression, regressing variable 2 onto variable 3 modanova <- lm(Var4 ~ as.factor(Var11), data=Dataset) OLS Regression, regressing variable 4 onto the categorical gender variable – This is an ANOVA! modanova1 <- aov(Var4 ~ as.factor(Var11), data=Dataset) aov is R’s built in ANOVA function dif <- TukeyHSD(modanova1) Tukey’s Honestly Significant Difference Test Exporting Data from R write.csv(Dataset, ‘Location.csv’) BOOM goes the dynamite Thank you!