You can import all kinds of data format in R using various packages. You simply need to tell R where to find the data in your computer – the directory or the path. Therefore, before we start loading data in R, let’s first see how can we set and change the working directory.

1 Set the Working Directory

A working directory is a hierarchical file system, or you can simple consider it as the path to the file or folder you want to use.

1.1 In an R file (not R Notebook)

#view your current working directory
getwd()

#change the working directory to your project folder
setwd("/Users/yaoyao/Documents/PLSC309/Lab1")
#view what is inside this working directory
dir()

1.2 In an R Notebook File

In R Notebook, the current working directory inside a notebook chunk is always the directory/folder containing that .Rmd file. For example, in your Lab assignment, you have created Name_Lab1.Rmd file. If you created and saved the the lab1.Rmd file inside a folder called “lab1” under “PLSC309” folder, the working directory would be the path all the way to folder “Lab1”.

What happens if you want to set the working directory elsewhere in R Notebook?

# Only changes working directory in this code chunk.  
setwd("/Users/yaoyao/Documents/PLSC309")
The working directory was changed to /Users/yaoyao/Documents/PLSC309 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.

The warning message you see above tells us that the setwd( ) command can only change the working directory in that one code chunk. For the rest of the notebook, the working directory is still the path to the folder where you save your .Rmd file.

There is only one way to change working directory globally for all the code chunks in a notebook file. The method is shown below, and you have to use it in the setup chunk: begin with {r setup} instead of {r}:

#change working directory for all the code chunks in a notebook 
knitr::opts_knit$set(root.dir = normalizePath("/Users/yaoyao/Documents/PLSC309/Lab1")) 

This is not very intuitive! You can avoid changing the working directory if you always put the relevant files in the same folder as you save the .Rmd Notebook file.


2 Import Data

2.1 Install packages

You can import all kinds of data format in R using various packages. To use those packages, you first need to install them. You only need to do this once.

File Type Package Function
Excel (.xlsx) openslxs read.xlsx
Stata (.dta, versions 12 and earlier) foreign read.dta
Stata (.dta, versions 13 and 14) readstata13 read.dta13
SPSS (.sav) foreign read.spss
Comma separated (.csv) none read.csv
Tab delimited (.tab or .txt) none read.table
R data files (.Rdata) none load
#Install necessary packages, you only need to do this once
install.packages("openxlsx")
install.packages("foreign")
install.packages("readstata13")

2.2 Load packages and import data

Although you only need to install packages to R once, you need to load the required packages in R using the library() function before you analysis every time in the beginning of your R session.

2.2.1 Read Excel file

#You need to library() a package before using it in R. 
library(openxlsx) 

#If your data is in the current working directory, you can simply do:
data <- read.xlsx("filename.xlsx")

# If your data is not in the corrent woking directory, you need to give R the path to the data, such as 
data <- read.xlsx("/Users/yaoyao/Documents/PLSC309/Lab1/filename.xlsx") # do not run

Now you might see the reason why Notebook assumes the folder containing your current ``.Rmd’’ file as the working directory. It is a good habit to keep all the relevant files for one project in one folder.

2.2.2 Read a comma-seperated (.csv) file

# If your data is in the current working directory, you can simply do:
ANES <- read.csv("cleaned_ANES.csv")
# If your data is not in the corrent woking directory, you need to give the path to the data, such as
turnout <- read.csv("/Users/yaoyao/Documents/PLSC309/Lab1/turnout.csv")

2.2.3 Read a R data (.Rdata) file

load("filename.Rdata")

2.2.4 Read a Stata (.dta, versions 12 and earlier) file

library(foreign)
data<-read.dta("filename.dta")

2.2.5 Read a Stata (.dta, versions 13 and 14) file

The ``foreign’’ pack does not support stata data after version 12. We can use “readstat13” to import from the later versions of stata instead.

library(readstata13)
# Read from stata version 13&14
data<-read.dta13("filename.dta")

2.2.6 Read .sav (SPSS) file

library(foreign)
data <- read.spss("filename.sav", to.data.frame = TRUE)

2.2.7 Read a Tab deliminted (.tab or .txt) file

data <- read.table("JoP_R_data.txt",sep="\t")

2.3 Arguments within read function

In some cases, you need to specify additional information of the data in the data importing function to be able to load the data in R in addition to the path and file name. You can get information about the arguments and their default in the R documents using the help() function.

help("read.table")
?read.table

#You will find all the arguemtns inside the function, and see their default:
read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)
#Practice and review what arguements are there inside different data import fuctions
?read.csv
?read.xlsx
?read.spss

2.3.1 Some useful arguments within read function

1. header

The header argument tells R whether to treat the first row of the data as the variable names. In the examples above, you probably have noticed that the default header argument in read.table is FALSE, while the default header argument in read.csv is TRUE. It means that by default, read.table() function will not treat the top row of the data as the variable name. If the first row of your data is the variable name, you’d want to change the default from FALSE to TRUE:

# If the first row of your data contains the variable names, you need to specify the header argument to TRUE in read.table. You do not need to specify the header to TRUE in read.csv, as it is the default in read.csv.
read.table(file, header = TRUE)
# If the first row of your csv data is not the variable names, you need to change the default of header in read.csv to FALSE
read.csv(file, header = FALSE)

2. sep

The sep argument tell R how to separate the values in the raw data. The default of read.table is sep=“”, which tells R to separate values on one or more white spaces. For example, under this default, ``Penn State’’ will be treated as two values, and will be put into two columns in a data frame.

If your data is not white space separated, you need to change this default accordingly. In the ``Penn State’’ example, the “Penn State” might be one variable, and should be put into one column instead of two. In this case, the data is not separated by white space, but tabs. If your data is tab separated, you need to change the sep argument to \("\t"\):

read.table(file, sep="\t")

If your data is comma separated, the sep argument should be “,” which is the default in read.csv.

3. to.data.frame in read.spss

When you read a SPSS data into R, the read.spss() function does not change the format of the data as a data frame. However, as most of our analyses are on a data frame, you’d want to change this default when you read a SPSS file:

read.spss(file, to.data.frame = TRUE)

2.4 Create data frame in R

#Create your variables
PSUScore <- c(52,33,56,21,45,31)
RivalScore <- c(0,14,0,19,14,7)
Win <- c(1,1,1,1,1,1)
Against <- c("Akron","Pittsburgh","Georgia State","Iowa","Indiana","Northwestern")

#Turn those variables into a date frame
NittanyLions2017 <- data.frame(Against,PSUScore,RivalScore,Win)

3 View data

In addition to the View() function, there are many other useful functions for getting information on your data.

3.1 View some basic features

# list all the objects in your current working environment
ls()
# list the variable names
names(ANES)
# dimension of your data
dim(ANES)
# view the first 2 rows in the data
head(ANES,n=2)
# view the last 2 rows in the data
tail(ANES,n=2)
# class of an object
class(ANES)
class(ANES$NPR)
# view summary statistics of a variable
summary(turnout$NPR)

3.2 Subset data

# hand pick rows and columns of a data
ANES[c(1,2),c(2,3)] # first two rows and second and third column
# exclude certain rows and columns
ANES[-1, -c(1,2,3)] # exclude the first row and first three columns 
# select more than one variables
ANES[c("NPR","rush","hannity")]
# select based on the values of observations
ANES[which(ANES$NPR==1),]
# select based on multiple critieria
ANES[which(ANES$Dem.Pres.cand.FT >=20 & ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer scores towards Clinton AND Trump are BOTH bigger or equal to 20 
ANES[which(ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer score towards Clinton OR Trump is bigger or equal to 20 
# subet use the subset fuction
subset(ANES, ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20) 

3.3 Sort data

You can sort your data on one or more variables, and you can set the order to be ascending or descending

# the order function returns a permutation which rearranges its first argument into ascending or descending order. The default is ascending. 
order(ANES$Dem.Pres.cand.FT)
# change into descending order using the decreasing argument inside order()
order(ANES$Dem.Pres.cand.FT, decreasing = TRUE)
# Sort the turnout data based on the ascending order of ANES 
ordered.ANES <- ANES[order(ANES$Dem.Pres.cand.FT,decreasing = TRUE),]
# Note after ordering, the NAs will always be at the end no matter whether the order is ascending ot descending

4 Manipulate data

The data sets you have been using at class have already been cleaned. They are in the correct formats and contains everything you need for the class analysis. However, in reality when you use other people’s data or download data online, you might need to use several variables in the data to calculate the variable you are interested in, and you might need to correct the class of a variable, and you might need to recode the values in the data.

4.1 Create new variables

Some times, you need to use multiple variables in a date set to create a variable of your interest. Instead of calculating the value every time when you need to use the variable, it is more convenient to create such variable first and add the new variable to the data set.

# Add a new variable Diff.FT to the ANES data frame
ANES$Diff.FT <- ANES$GOP.Pres.cand.FT - ANES$Dem.Pres.cand.FT

4.2 Recode values

4.2.1 Change the class

A data set generally contains variables with different class/characteristics, such as numeric, character, factor and logical. When analyzing the data, sometimes you need to convert the class of the variable.

# check whether the variable is certain class
is.numeric()
is.character()
is.factor()
# convert to the correct class for the analysis
as.numeric()
as.character()
as.factor()

4.2.2 Change the value

When doing analysis, we often need to recode some of the values. For example, in survey data, the gender might be coded as “female” or “male” and we want to recode it as 0 and 1. We can use the recode function from the package car.

First, install and load the package Car:

# Install the car package, you only need to instal once
install.packages("car")

# Load the car package
library(car)

Second, try the recode function!

  • Change all the 1 into “female” and all the 0 into “male” in the female variable in ANES data
ANES$female <- recode(ANES$female, "1 = 'female';0 = 'male'")
  • Change a range of data into certain value: recode the feeling thermometer scores less than 25 into 1 (meaning “very negative”), 25 to 49 into 2 (meaning “somewhat negative”), 50 to 74 into 3 (meaning “somewhat positive”), and 75 to 100 into 4 (meaning “very positive”)
ANES$Dem.Pres.cand.FT <- recode(ANES$Dem.Pres.cand.FT, "0:24=1; 25:49=2;50:74=3;75:100=4;else=NA")
  • After recoding the value, always double check the recoded variable
table(ANES$female)
summary(ANES$Dem.Pres.cand.FT)

Sometimes, you might only want to change one specific cell. For example, you find the value for a NA cell. You can recode that cell by locating its row.

# For example, I find that the first missing value in the feeling thermometer towards Clinton should be 39
ANES$Dem.Pres.cand.FT[which(is.na(ANES$Dem.Pres.cand.FT))[1]] <- 39

5 Export Data

Note that the changes you made in R and R environment do not change the data in your folder. Therefore, after recoding or adding variables, you will want to save the updated data in your computer, so you do not need to repeat the data manipulation. Using the same packages, we can save/export the data you created or worked in R to your computer as various formats. However, csv format is recommended.

5.1 Save data as csv file

# write the data to the current working directory
write.csv(ANES,"updatedANES.csv")
# write the data to other folders in your computer
write.csv(ANES,"path/updatedANES.csv")

5.2 Save data as other format

# Excel
write.xlsx(ANES,"updatedANES.xlsx")
# Stata
write.dta(ANES,"updatedANES.dta")

Helpsheet made for PLSC 309 by Yaoyao Dai

