In our research, we often need information from multiple data sets, and we need to merge those data sets into one data frame for future analysis. For example, when we replicate one of Inglehart’s research questions on whether regime type influences the mass support for democracy, the information we interested in are in three different data sets: the world value survey (WVS), the European Value Study(EVS), and the Freedom House Score. The WVS and EVS data each contains public opinion data of a subset of countries in the world. Combining these two would give us a larger sample of countries. The Freedom House data contains the regime types.

To answer Inglehart’s research question, we want to have a merged data frame with country names, citizens’ opinion towards democracy in each country, and each country’s corresponding regime type. We will practice merging data vertically and horizontally in this example.

1 Load the data sets

library(openxlsx) 
WVS <- read.xlsx("WVS5.xlsx")
EVS <- read.xlsx("EVS2008.xlsx")
FH <- read.xlsx("FreedomHouse2017.xlsx")
# View and get familar with the date sets you will be using
dim(WVS)
dim(EVS)
dim(FH)
names(WVS)
names(EVS)
names(FH)
summary(WVS)
summary(EVS)
summary(FH)

2 Merge Vertically (add rows/observations)

Merging data vertically means that we add more observations to the current data. In other words, we stack two data sets. In our case, we want to merge the WVS data and EVS data vertically, because these two data sets contain the same variables but with different observations (countries). We can use the function rbind() to add the new data to the current turnout data (merge them vertically). You should be carefully about the variable names and order in the new data set. The names and order of the variable of the second data set should match exactly with the first one.

2.1 View the names of the data sets that you want to stack

names(WVS)
names(EVS)

We can see that EVS contains one extra variable: “Country_Abrv”. The order of the rest variables are the same but with slightly different names. We need to fix these two problems in the next step.

2.2 Fix any variables that do not match

#delete the first collum of EVS, the extra variable that we are not gonna use
EVS <- EVS[,-1]
#change variable names in WVS to match EVS
names(WVS)<-names(EVS)
#before you merge, examine the two datasets again to make sure that they have the same variables and variable names 
names(WVS)
names(EVS)
#you can also use set operation to check are there any variable names of WVS that is not in the names of EVS, vice versa:
setdiff(names(WVS),names(EVS))

2.3 Check whether there are overlapping observations

#Intersection returns the observations appeared in both variables
intersect(WVS$Country,EVS$Country)
#Alternatively you can also use is.element function, which is also a set operation. It returns logic values showing whether the observation in X variable also appears in the Y variable
is.element(WVS$Country,EVS$Country)

There are 20 observations appeared in both data set. We need to delete the extra entries (in one of the data set):

WVS.unique <- WVS[-which(is.element(WVS$Country,EVS$Country)),]
# Double check 
dim(WVS.unique) # We should have deleted 20 observations and kept all the variables. The correct observations should be 56-20 = 36.

2.4 Check whether the variables are on the same scale

In fact if you check closely in the beginning of this document when we load and summarize the data, you should already notice that the variables are not on the same scale: EVS is the percentage.

summary(WVS.unique)
summary(EVS)

2.5 Convert one of the data to be on the same scale

WVS.unique[,c(2:9)] <- WVS.unique[,c(2:9)]*100
#check again
summary(WVS.unique)

2.6 Stack WVS and EVS, create a new dataframe

PublicOpinion <- rbind(WVS.unique,EVS)
# Double check
dim(PublicOpinion) # 46 + 36
summary(PublicOpinion)

3 Merge horizontally (add columns/variables)

Now we want to merge the PublicOpion data we just created with the Freedom house score. You can consider the horizontal merging as adding variables to the data, or adding additional characteristics/information about each country in this case. To merge data this way, the new data you want to merge/add to the current data must share at least one variable with the current data for r to match observations. In our example, we want to match the two data sets by country name.

#check the variable name for country. In our example, the two data sets happen to have the same variable name of the country variable.
names(FH)
names(PublicOpinion)
#Check how many countries each data have
length(FH$Country)
length(PublicOpinion$Country)

3.1 Merge Data

#Merge by country name 
merged.all <- merge(PublicOpinion,FH,by="Country",
                all = TRUE) # Keep all the unique country names
merged.intersect <- merge(PublicOpinion,FH,by="Country",
                all = FALSE) # Only merge and return countries that appear in both data sets
merged.x <- merge(PublicOpinion,FH,by="Country",
                all.x = TRUE) # Keep all the countries in data X, the PublicOpinion data in the example
merged.y <- merge(PublicOpinion,FH,by="Country",
                all.y = TRUE) # Keep all the countries in data Y, the Freedom House score

#If the Index variable has different names in the two datasets, you can do
merged.a <- merge(PublicOpinion,FH,
                  by.x="Country",by.y="Country", 
                all = TRUE)

#Check the merged data
dim(merged.all)
dim(merged.both)
dim(merged.x) # should match the number of observation in PublicOpinion
dim(merged.y) # should match the number of observation in FH

3.2 Diagnostic

Each of the two data sets contains country names that are not in the other data set. There are two possible reasons: 1) some countries are missing in either one of them; 2) some countries are spelled differently

setdiff(PublicOpinion$Country,FH$Country)

# Change those country names that are spelled differently 
library(car)
PublicOpinion$Country<-recode(PublicOpinion$Country,"'Argenti'='Argentina';'BurkiFaso'='Burkina Faso';'Chi'='China';'GreatBritain'='United Kingdom';'NewZealand'='New Zealand';'SouthAfrica'='South Africa';'SouthKorea'='South Korea';'TrinidadandTobago'='Trinidad and Tobago';'UnitedStates'='United States of America';'Vietm'='Vietnam'")

#Check whether there are countries appeared more than once
length(unique(PublicOpinion$Country))
table(PublicOpinion$Country)[which(table(PublicOpinion$Country)>1)] 
#Delete the extra entry

PublicOpinion<-PublicOpinion[-which(PublicOpinion$Country=="United Kingdom")[1],]

3.3 Merge Data

merged <- merge(PublicOpinion,FH,by="Country",
                all.x = TRUE)
#Check whether you have succeed
dim(PublicOpinion)
dim(merged)  # The merged data should have the same number of observations as in PublicOpinion, because we set all.x = TRUE. We set the merge function to include all the variable in X dataframe. 

Helpsheet made for PLSC 309 by Yaoyao Dai

---
title: "Merge Data in R"
output:
  html_notebook: 
    theme: cerulean
    number_sections: TRUE
    toc: TRUE
    toc_depth: 2
    toc_float: TRUE
---

***
In our research, we often need information from multiple data sets, and we need to merge those data sets into one data frame for future analysis. For example, when we replicate one of Inglehart's research questions on whether regime type influences the mass support for democracy, the information we interested in are in three different data sets: the world value survey (WVS), the European Value Study(EVS), and the Freedom House Score. The WVS and EVS data each contains public opinion data of a subset of countries in the world. Combining these two would give us a larger sample of countries. The Freedom House data contains the regime types. 

To answer Inglehart's research question, we want to have a merged data frame with country names, citizens' opinion towards democracy in each country, and each country's corresponding regime type. We will practice merging data vertically and horizontally in this example.

# Load the data sets

```{r}
library(openxlsx) 
WVS <- read.xlsx("WVS5.xlsx")
EVS <- read.xlsx("EVS2008.xlsx")
FH <- read.xlsx("FreedomHouse2017.xlsx")
```

```{r}
# View and get familar with the date sets you will be using
dim(WVS)
dim(EVS)
dim(FH)
names(WVS)
names(EVS)
names(FH)
summary(WVS)
summary(EVS)
summary(FH)
```

# Merge Vertically (add rows/observations)

Merging data vertically means that we add more observations to the current data. In other words, we stack two data sets. In our case, we want to merge the WVS data and EVS data vertically, because these two data sets contain the same variables but with different observations (countries). We can use the function rbind() to add the new data to the current turnout data (merge them vertically). You should be carefully about the variable names and order in the new data set. The names and order of the variable of the second data set should match exactly with the first one.

##View the names of the data sets that you want to stack
```{r}
names(WVS)
names(EVS)
```
We can see that EVS contains one extra variable: "Country_Abrv". The order of the rest variables are the same but with slightly different names. We need to fix these two problems in the next step.

##Fix any variables that do not match 
```{r}
#delete the first collum of EVS, the extra variable that we are not gonna use
EVS <- EVS[,-1]
#change variable names in WVS to match EVS
names(WVS)<-names(EVS)
#before you merge, examine the two datasets again to make sure that they have the same variables and variable names 
names(WVS)
names(EVS)
#you can also use set operation to check are there any variable names of WVS that is not in the names of EVS, vice versa:
setdiff(names(WVS),names(EVS))
```

##Check whether there are overlapping observations
```{r}
#Intersection returns the observations appeared in both variables
intersect(WVS$Country,EVS$Country)
#Alternatively you can also use is.element function, which is also a set operation. It returns logic values showing whether the observation in X variable also appears in the Y variable
is.element(WVS$Country,EVS$Country)
```
There are 20 observations appeared in both data set. We need to delete the extra entries (in one of the data set):
```{r}
WVS.unique <- WVS[-which(is.element(WVS$Country,EVS$Country)),]
# Double check 
dim(WVS.unique) # We should have deleted 20 observations and kept all the variables. The correct observations should be 56-20 = 36.
```

##Check whether the variables are on the same scale 

In fact if you check closely in the beginning of this document when we load and summarize the data, you should already notice that the variables are not on the same scale: EVS is the percentage. 

```{r}
summary(WVS.unique)
summary(EVS)
```

##Convert one of the data to be on the same scale
```{r}
WVS.unique[,c(2:9)] <- WVS.unique[,c(2:9)]*100
#check again
summary(WVS.unique)
```


##Stack WVS and EVS, create a new dataframe
```{r}
PublicOpinion <- rbind(WVS.unique,EVS)
# Double check
dim(PublicOpinion) # 46 + 36
summary(PublicOpinion)
```

# Merge horizontally (add columns/variables)

Now we want to merge the PublicOpion data we just created with the Freedom house score. You can consider the horizontal merging as adding variables to the data, or adding additional characteristics/information about each country in this case. To merge data this way, the new data you want to merge/add to the current data must share at least one variable with the current data for r to match observations. In our example, we want to match the two data sets by country name. 

```{r}
#check the variable name for country. In our example, the two data sets happen to have the same variable name of the country variable.
names(FH)
names(PublicOpinion)
#Check how many countries each data have
length(FH$Country)
length(PublicOpinion$Country)
```

##Merge Data
```{r}
#Merge by country name 
merged.all <- merge(PublicOpinion,FH,by="Country",
                all = TRUE) # Keep all the unique country names
merged.intersect <- merge(PublicOpinion,FH,by="Country",
                all = FALSE) # Only merge and return countries that appear in both data sets
merged.x <- merge(PublicOpinion,FH,by="Country",
                all.x = TRUE) # Keep all the countries in data X, the PublicOpinion data in the example
merged.y <- merge(PublicOpinion,FH,by="Country",
                all.y = TRUE) # Keep all the countries in data Y, the Freedom House score

#If the Index variable has different names in the two datasets, you can do
merged.a <- merge(PublicOpinion,FH,
                  by.x="Country",by.y="Country", 
                all = TRUE)

#Check the merged data
dim(merged.all)
dim(merged.both)
dim(merged.x) # should match the number of observation in PublicOpinion
dim(merged.y) # should match the number of observation in FH
```

##Diagnostic
Each of the two data sets contains country names that are not in the other data set. There are two possible reasons: 1) some countries are missing in either one of them; 2) some countries are spelled differently 

```{r}
setdiff(PublicOpinion$Country,FH$Country)

# Change those country names that are spelled differently 
library(car)
PublicOpinion$Country<-recode(PublicOpinion$Country,"'Argenti'='Argentina';'BurkiFaso'='Burkina Faso';'Chi'='China';'GreatBritain'='United Kingdom';'NewZealand'='New Zealand';'SouthAfrica'='South Africa';'SouthKorea'='South Korea';'TrinidadandTobago'='Trinidad and Tobago';'UnitedStates'='United States of America';'Vietm'='Vietnam'")

#Check whether there are countries appeared more than once
length(unique(PublicOpinion$Country))
table(PublicOpinion$Country)[which(table(PublicOpinion$Country)>1)] 
#Delete the extra entry

PublicOpinion<-PublicOpinion[-which(PublicOpinion$Country=="United Kingdom")[1],]
```

##Merge Data 
```{r}
merged <- merge(PublicOpinion,FH,by="Country",
                all.x = TRUE)
#Check whether you have succeed
dim(PublicOpinion)
dim(merged)  # The merged data should have the same number of observations as in PublicOpinion, because we set all.x = TRUE. We set the merge function to include all the variable in X dataframe. 
```

***
Helpsheet made for PLSC 309 by Yaoyao Dai
