You are currently viewing R Join (Merge) on Multiple Columns

How to join or merge multiple columns of the data frames in R? You can use the R base merge() function or the dplyr functions for joining data frames on multiple columns. Using the dplyr functions is the best approach as it runs faster than the R-based approach. dplyr package provides several functions to join R data frames, which supports joining on multiple columns.

Advertisements

1. Quick Examples

Following are quick examples of joining data frames on multiple columns.


# Quick Examples of R joining on multiple columns 

# Example 1: Using dplyr
library(dplyr)
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Example 2: Using dplyr when columns are same
df2 <- emp_df %>% inner_join( dept_df, 
                              by=c('dept_id','dept_branch_id'))

# Example 3: Using merge
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Example 4R Split Column into Multiple Columns of DataFrame: Using merge when columns are same
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

Let’s create two Data Frames with multiple column names same on both.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r join multiple columns

2. Using dplyr to Join Multiple Columns in R

Using join functions from the dplyr package is the best approach to join data frames on multiple columns in R, all dplyr join functions inner_join(), left_join(), right_join(), full_join(), anti_join(), semi_join() support joining on multiple columns. In the below example I will cover using the inner_join().

2.1 Syntax

Following is the syntax of inner_join() and a similar syntax is used for other joins in the dplyr package.


# Syntax
inner_join(df1, df2, by=c('x1'='y1', 'x2'='y2'))

Here,

  • The value in the x1 column of df1 matches the value in the y1 column of df2.
  • The value in the x2 column of df1 matches the value in the y2 column of df2.

2.2 Join Multiple Columns Example

Before using the dplyr functions we need to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in the dplyr package accept a data.frame as their first argument. When using the dplyr package, we often employ the infix operator %>% from the magrittr package. This operator takes the value on the left and uses it as the first argument for the function on the right. For instance, x %>% f(y) is transformed into f(x, y), indicating that the output from the left side is “piped” into the function on the right side.


# Load dplyr package
library(dplyr)

# Join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))
df2

Yields below output.

joining multiple columns

Since we have two data.frames share the same joining column names, you can simply write the above statement as.


# Load dplyr package
library(dplyr)

# join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id','dept_branch_id'))
df2

3. Using base merge() to Join Multiple Columns

The merge() function from the R base can also be used to join multiple data frame columns. To do so you need to create a vector for by.x with the columns you want to join and create a similar vector for by.y.

3.1 Syntax


# Syntax of 
merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))

Here,

The value in the x_col1 column of df1 matches the value in the y_col1 column of df2.
The value in the x_col2 column of df1 matches the value in the y_col2 column of df2.

3.2 Merge Multiple Columns Example

In this merge example, emp_df is considered a left table, and dept_df is considered a right table and this performs the inner join on these data frame tables. In case you want to use other joins with merge() refer to R join data frames.


# R join multiple columns
df2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))
df2

Yields the same output as above.

Similarly, when you are joining on same column names on both data frames use.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))
             
df2

4. Conclusion

In this article, you have learned how to join or merge data frames on multiple columns using both the base R merge() function and the join functions from the dplyr package. The dplyr approach is recommended for larger datasets due to its superior performance compared to base R functions.