You are currently viewing R Join on Different Column Names

How to perform join/merge on different column names in R? To join data frames on the different columns in R use either base merge() function or use dplyr functions. Using the dplyr functions is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join R data frames and all these supports merge on the different column names.

1. Quick Examples of Joining on Different Columns

Following are quick examples of joining/merging data frames on different column names.


# Quick Examples

# Using dplyr
library(dplyr)
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('emp_dept_id'='dept_id', 
             'emp_dept_branch_id'='dept_branch_id'))
df2

# Using merge
df2 <- merge(x=emp_df,y=dept_df, 
             by.x=c("emp_dept_id","emp_dept_branch_id"), 
             by.y=c("dept_id","dept_branch_id"))
df2

Let’s create two Data Frames with column names different on both and will use these to perform the merge operation. In the below example, I will be joining emp_dept_id from emp_df with dept_id form dept_df and emp_dept_branch_id with dept_branch_id.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  emp_dept_id=c(10,20,10,10,40,50),
  emp_dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r join different column names

2. Using dplyr to Join Different Column Names in R

Using join functions from dplyr package is the best approach to joining data frames on different column names in R, all dplyr functions like inner_join(), left_join(), right_join(), full_join(), anti_join(), semi_join() support joining on different columns. In the below example I will cover using the inner_join().

2.1 Syntax

Following is the syntax of inner_join() and a similar syntax is used for other joins in the dplyr package.


# Syntax
inner_join(df1, df2, by=c('x1'='y1', 'x2'='y2'))

Here,

  • The value in the x1 column of df1 matches the value in the y1 column of df2.
  • The value in the x2 column of df1 matches the value in the y2 column of df2.

2.2 Join Different Column Names Example

In order to use dplyr, you have to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('emp_dept_id'='dept_id', 
             'emp_dept_branch_id'='dept_branch_id'))
df2

Yields below output.

join dataframe different column names

3. Using merge() to Join Different Column Names

Using merge() function from the R base can also be used to perform joining on different column names. To do so you need to create a vector for by.x with the columns you wanted to join on and create a similar vector for by.y.

3.1 Syntax


# Syntax of 
merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))

Here,

The value in the x_col1 column of df1 matches the value in the y_col1 column of df2.
The value in the x_col2 column of df1 matches the value in the y_col2 column of df2.

3.2 Merge Different Column Names Example

In this example, emp_df is considered a left table, and dept_df is considered a right table and this performs the inner join on these tables, in case you wanted to use other joins with merge() refer to R join data frames.


# Using merge on different column names
df2 <- merge(x=emp_df,y=dept_df, 
             by.x=c("emp_dept_id","emp_dept_branch_id"), 
             by.y=c("dept_id","dept_branch_id"))
df2

Yields the same output as above.

4. Conclusion

In this article, you have learned how to join/merge data frames on different column names using R base merge() function and join functions from dplyr package. Using dplyr approach is the best to use when you are joining on larger datasets as it performs efficiently over the R base.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium