You are currently viewing R Join (Merge) on Multiple Columns

How to join or merge multiple columns of the data frames in R? To join data frames on multiple columns in R use either the base merge() function or the dplyr functions. Using the dplyr functions is the best approach as it runs faster than the R-based approach. dplyr package provides several functions to join R data frames, which supports joining on multiple columns.

Advertisements

1. Quick Examples

Following are quick examples of joining data frames on multiple columns.


# Quick Examples of R joining on multiple columns 

# Example 1: Using dplyr
library(dplyr)
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Example 2: Using dplyr when columns are same
df2 <- emp_df %>% inner_join( dept_df, 
                              by=c('dept_id','dept_branch_id'))

# Example 3: Using merge
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Example 4R Split Column into Multiple Columns of DataFrame: Using merge when columns are same
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

Let’s create two Data Frames with multiple column names same on both. In the below example dept_id and dept_branch_id are the same on both emp_df and dept_df data frames.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r join multiple columns

2. Using dplyr to Join Multiple Columns in R

Using join functions from dplyr package is the best approach to join data frames on multiple columns in R, all dplyr join functions inner_join(), left_join(), right_join(), full_join(), anti_join(), semi_join() support joining on multiple columns. In the below example I will cover using the inner_join().

2.1 Syntax

Following is the syntax of inner_join() and a similar syntax is used for other joins in the dplyr package.


# Syntax
inner_join(df1, df2, by=c('x1'='y1', 'x2'='y2'))

Here,

  • The value in the x1 column of df1 matches the value in the y1 column of df2.
  • The value in the x2 column of df1 matches the value in the y2 column of df2.

2.2 Join Multiple Columns Example

To use dplyr, you have to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 


# Load dplyr package
library(dplyr)

# Join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))
df2

Yields below output.

joining multiple columns

Since we have two data.frames share the same joining column names, you can simply write the above statement as.


# Load dplyr package
library(dplyr)

# join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id','dept_branch_id'))
df2

3. Using base merge() to Join Multiple Columns

Using merge() function from the R base can also be used to perform joining on multiple columns of the data frames. To do so you need to create a vector for by.x with the columns you want to join and create a similar vector for by.y.

3.1 Syntax


# Syntax of 
merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))

Here,

The value in the x_col1 column of df1 matches the value in the y_col1 column of df2.
The value in the x_col2 column of df1 matches the value in the y_col2 column of df2.

3.2 Merge Multiple Columns Example

In this merge example, emp_df is considered a left table, and dept_df is considered a right table and this performs the inner join on these data frame tables. In case you want to use other joins with merge() refer to R join data frames.


# R join multiple columns
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))
df2

Yields the same output as above.

Similarly, when you are joining on same column names on both data frames use.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))
             
df2

4. Conclusion

In this article, you have learned how to join or merge data frames on multiple columns using the R base merge() function and join functions from the dplyr package. Using the dplyr approach is the best to use when you are joining on larger datasets as it performs efficiently over the R base.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium