R Join (Merge) on Multiple Columns

Hot to join or merge on multiple columns in R? To join data frames on multiple columns in R use either base merge() function or use dplyr functions. Using the dplyr functions is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join R data frames and all these supports joining on multiple columns.

1. Quick Examples

Following are quick examples of joining data frames on multiple columns.


# Quick Examples

# Using dplyr
library(dplyr)
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Using dplyr when columns are same
df2 <- emp_df %>% inner_join( dept_df, 
                              by=c('dept_id','dept_branch_id'))

# Using merge
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Using merge when columns are same
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

Let’s create two Data Frames with multiple column names same on both. In the below example dept_id and dept_branch_id are same on both emp_df and dept_df data frames.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r join multiple columns

2. Using dplyr to Join Multiple Columns in R

Using join functions from dplyr package is the best approach to join data frames on multiple columns in R, all dplyr join functions inner_join(), left_join(), right_join(), full_join(), anti_join(), semi_join() support joining on multiple columns. In the below example I will cover using the inner_join().

2.1 Syntax

Following is the syntax of inner_join() and a similar syntax is used for other joins in the dplyr package.


# Syntax
inner_join(df1, df2, by=c('x1'='y1', 'x2'='y2'))

Here,

  • The value in the x1 column of df1 matches the value in the y1 column of df2.
  • The value in the x2 column of df1 matches the value in the y2 column of df2.

2.2 Join Multiple Columns Example

In order to use dplyr, you have to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 


# Load dplyr package
library(dplyr)

# Join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))
df2

Yields below output.

joining multiple columns

Since we have two data.frames shares the same joining column names, you can simply write the above statement as.


# Load dplyr package
library(dplyr)

# join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id','dept_branch_id'))
df2

3. Using base merge() to Join Multiple Columns

Using merge() function from the R base can also be used to perform joining on multiple columns of data frame. To do so you need to create a vector for by.x with the columns you wanted to join on and create a similar vector for by.y.

3.1 Syntax


# Syntax of 
merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))

Here,

The value in the x_col1 column of df1 matches the value in the y_col1 column of df2.
The value in the x_col2 column of df1 matches the value in the y_col2 column of df2.

3.2 Merge Multiple Columns Example

In this merge example, emp_df is considered a left table, and dept_df is considered a right table and this performs the inner join on these data frame tables. In case you wanted to use other joins with merge() refer to R join data frames.


# R join multiple columns
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))
df2

Yields the same output as above.

Similarly, when you are joining on same column names on both data frames use.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))
             
df2

4. Conclusion

In this article, you have learned how to join or merge data frames on multiple columns using R base merge() function and join functions from dplyr package. Using dplyr approach is the best to use when you are joining on larger datasets as it performs efficiently over the R base.

Related Articles

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing R Join (Merge) on Multiple Columns