You are currently viewing R – Join Two or Multiple DataFrames

We will see how to perform the join operation on two or multiple DataFrames in R using merge() function. There are different join types including inner join, left join, right join, and outer join. If there are more than two dataframes to be joined, then you can use reduce() method available in tidyverse library. It will also support the above four types of joins. In this article, we will see how to join two or multiple dataframes in R with examples.

Advertisements

1. Quick Examples

If you are in a hurry, below are quick examples of how to join two or multiple R dataframes.


#Perform inner join on my_dataframe1 and my_dataframe2 based on id column
print(merge(x= my_dataframe1,y= my_dataframe2, by = "id"))

#Perform left join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all.x = TRUE))

#Perform right join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all.y = TRUE))

#Perform outer join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all = TRUE))

#Perform cross join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = NULL))

#Load the tidyverse - package
library(tidyverse)
  
  #Join three dataframes based on id column by performing inner join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(inner_join, by='id'))

  #Join three dataframes based on id column by performing left join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(left_join, by='id'))

#Join three dataframes based on id column by performing right join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(right_join, by='id'))

#Join three dataframes based on id column by performing outer join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(full_join, by='id'))

Let’s create three dataframes that are required to explain the different join types.


#create dataframe with 3 rows and 2 columns
my_dataframe1=data.frame(id=c(2,1,3),gender=c('m','m','f'))

#display dataframe
print(my_dataframe1)

#Create dataframe with 3 rows and 2 columns
my_dataframe2=data.frame(id=c(4,3,1),name=c('shyam','sravani','vamsi'))

#Display dataframe
print(my_dataframe2)

#Create dataframe with 3 rows and 2 columns
my_dataframe3=data.frame(id=c(4,3,2),marks=c(79,80,97))

#Display dataframe
print(my_dataframe3)

Data in three dataframes:


# Output

# Dataframe 1
  id gender
1  2      m
2  1      m
3  3      f

# Dataframe 2
  id    name
1  4   shyam
2  3 sravani
3  1   vamsi

# Dataframe 3
  id marks
1  4    79
2  3    80
3  2    97

So, there are 2 columns in each dataframe, and the id column is common in all three dataframes. Now, we will see different types of joins that are performed on the R dataframes based on the id column.

2. Join Two R DataFrames

merge() in R is used to Join two dataframes and perform different kinds of joins. Let’s see them one by one.

2.1. Inner Join

Inner Join is also known as Natural Join used to join two dataframes. It will join only the matched rows from both the dataframes based on the column specified.

Syntax:


#Syntax for merge()
merge(x= my_dataframe1,y= my_dataframe2, by = "column")

where,

  1. my_dataframe1 is the first dataframe specified by x.
  2. my_dataframe2 is the second dataframe specified by y.
  3. the column is the name of the column such that both the dataframes are joined based on this column specified using by.

Example:

Perform Inner/Natural Join on the first two dataframes.


#Perform inner join on my_dataframe1 and my_dataframe2 based on id column
print(merge(x= my_dataframe1,y= my_dataframe2, by = "id"))

Output:


# Output
  id gender    name
1  1      m   vamsi
2  3      f sravani 

After performing Join, we can see that there are only two ids 1 and 3 that are common in the first two dataframes. So Inner join returned the rows related to id’s 1 and 3.

2.2. Left Join

Left Join is also known as Left Outer Join used to join two R dataframes. It will return all rows from the first dataframe and only matched rows from the second dataframe with respect to the first dataframe. It will return <NA> for unmatched values in the second dataframe.

Syntax:


# Syntax
merge(x= my_dataframe1,y= my_dataframe2, by = "column",all.x = TRUE)

where,

  1. my_dataframe1 is the first dataframe specified by x.
  2. my_dataframe2 is the second dataframe specified by y.
  3. the column is the name of the column such that both the dataframes are joined based on this column specified using by.
  4. all.x specifies the first dataframe – my_dataframe1 which will take boolean value TRUE to return all rows from my_dataframe1.

Example:

Perform Left Join on the first two dataframes.


#perform left join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all.x = TRUE))

Output:


# Output
  id gender    name
1  1      m   vamsi
2  2      m    <NA>
3  3      f sravani

After performing Left Join, we can see that id-2 is not there in the second dataframe. hence <NA> is replaced under the second dataframe name column.

2.3. Right Join

Right Join is also known as Right Outer Join used to join two dataframes. It will return all rows from the second dataframe and only matched rows from the first dataframe with respect to the second dataframe. It will return <NA> for unmatched values in the first dataframe.

Syntax:


# Syntax
merge(x= my_dataframe1,y= my_dataframe2, by = "column",all.y = TRUE)

where,

  1. my_dataframe1 is the first dataframe specified by x.
  2. my_dataframe2 is the second dataframe specified by y.
  3. the column is the name of the column such that both the dataframes are joined based on this column specified using by.
  4. all.y specifies the second dataframe – my_dataframe2 which will take boolean value TRUE to return all rows from my_dataframe2.

Example:

Perform Right Join on the first two dataframes.


#perform right join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all.y = TRUE))

Output:


# Output
  id gender    name
1  1      m   vamsi
2  3      f sravani
3  4   <NA>   shyam

After performing Right Join, we can see that id-4 is not there in the first dataframe. Hence <NA> is replaced under the first dataframe gender column.

2.4. Outer Join

Outer Join is also known as Full Join used to join two dataframes. It will return all matched and unmatched rows from both the dataframes. It will return <NA> for unmatched values in both the dataframes.

Syntax:


# Syntax
merge(x= my_dataframe1,y= my_dataframe2, by = "column",all = TRUE)

where,

  1. my_dataframe1 is the first dataframe specified by x.
  2. my_dataframe2 is the second dataframe specified by y.
  3. the column is the name of the column such that both the dataframes are joined based on this column specified using by.
  4. all specifies two dataframes that will take boolean value TRUE to return all rows from my_dataframe1 and my_dataframe2.

Example:

Perform Outer Join on the first two dataframes.


#perform outer join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = "id",all = TRUE))

Output:


# Output
  id gender   name
1  1      m   vamsi
2  2      m    <NA>
3  3      f sravani
4  4   <NA>   shyam

After performing Outer Join, we can see that all rows from both the dataframes were returned.

2.5. Cross Join

Cross Join is used to join two R dataframes. It will perform join in each row of the first dataframe with all rows in the second dataframe.

Syntax:


# Syntax
merge(x= my_dataframe1,y= my_dataframe2, by = NULL)

where,

  1. my_dataframe1 is the first dataframe specified by x.
  2. my_dataframe2 is the second dataframe specified by y.
  3. by parameter takes NULL

Example:

Perform Cross Join on the first two dataframes.


#perform cross join on my_dataframe1 and my_dataframe2 based on id column
print(merge( x=my_dataframe1, y=my_dataframe2, by = NULL))

Output:


# Output
  id.x gender id.y    name
1    2      m    4   shyam
2    1      m    4   shyam
3    3      f    4   shyam
4    2      m    3 sravani
5    1      m    3 sravani
6    3      f    3 sravani
7    2      m    1   vamsi
8    1      m    1   vamsi
9    3      f    1   vamsi

After performing Cross Join, we can see that each row in the first dataframe is joined with all rows in the second dataframe.

3. Join Multiple R DataFrames

To join more than two (multiple) R dataframes, then reduce() is used. It is available in the tidyverse package which will convert all the dataframes to a list and join the dataframes based on the column. It is similar to the above joins.

tidyverse is a third-party library hence, in order to use tidyverse library, you need to first install it by using install.packages('tidyverse'). Once installation completes, load the tidyverse library in order to use this reduce() method. To load a library in R use library("tidyverse").

Syntax:


# Syntax
list(my_dataframe1,my_dataframe2,............) %>% reduce(keyword, by='column')

reduce() takes two parameters.

  1. keyword – It is the join type keyword that specifies the type of join
  2. column – Column name in which dataframes are joined based on this column.

Use below values for keyword (join types).

  • To perform an inner join – the inner_join keyword is used.
  • To perform left join – the left_join keyword is used.
  • To perform right join – the right_join keyword is used.
  • To perform an outer join – the full_join keyword is used.

Example:

In this example, we will perform all types of joins on three dataframes such that dataframes are joined based on the id column.


#load the tidyverse - package
library(tidyverse)
  
  #Join three dataframes based on id column by performing inner join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(inner_join, by='id'))

  #Join three dataframes based on id column by performing left join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(left_join, by='id'))

#Join three dataframes based on id column by performing right join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %>% reduce(right_join, by='id'))

#Join three dataframes based on id column by performing outer join
print(list(my_dataframe1,my_dataframe2,my_dataframe3) %%>% reduce(full_join, by='id'))

Output:


# Output
# Inner Join
  id gender    name marks
1  3      f sravani    80

# Left Join
  id gender    name marks
1  2      m    <NA>    97
2  1      m   vamsi    NA
3  3      f sravani    80

# Right join
  id gender    name marks
1  3      f sravani    80
2  4   <NA>   shyam    79
3  2   <NA>    <NA>    97

# Full Outer Join
  id gender    name marks
1  2      m    <NA>    97
2  1      m   vamsi    NA
3  3      f sravani    80
4  4   <NA>   shyam    79

4. Conclusion

In this article, you have learned how to perform different types of join on two or multiple R dataframes. If there are two dataframes, then merge() is used, otherwise, you can go with reduce() method.

Reference

  1. merge() in R
  2. reduce()