How to do Inner Join in R?

How can you perform an inner join on data frames in R? You can use the merge() function, the inner_join() function from dplyr, or the reduce() function from tidyverse. The inner_join() function in dplyr is typically the most efficient option because it generally runs faster than the base R methods. This package provides a variety of functions for joining data frames in R.

In R, Inner join or natural join is the default join and it’s mostly used in joining data frames, it is used to join data.frames on a specified single or multiple columns, and where column values don’t match the rows get dropped from both data.frames (emp & dept).

Quick Examples of Inner Join

Here are some quick examples of how to perform inner joins on data frames.


# Quick Examples

# Inner join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id")

# Inner join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))

# Inner join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))

# Load dplyr package
library(dplyr)

# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - inner join on different columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Inner Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2

Let’s build two data frames. In this case, both emp_df and dept_df have the dept_id and dept_branch_id columns.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

Perform Inner Join by R Base

The R base package includes a merge() function which performs an inner join on two or more data frames. With this function, you pass the left data frame as x, the right one as y, and then set the by parameter to specify which column(s) to join on.

Here’s an example where we perform an inner join using the dept_id column from the emp_df and dept_df data frames. Here by default, it uses all=FALSE. This join is similar to a set intersection operation.


# inner join
df2 <- merge(x=emp_df,y=dept_df, 
                by="dept_id")
df2

The output shown below demonstrates that if the data frames you are merging have the same column names not involved in the join condition, the merge() function adds suffixes x and y to these columns in the resulting data frame. In the example below, note how the dept_branch_id column is displayed.

Perform Inner Join on Multiple Columns

To execute an inner join on multiple columns with identical names in both data frames, pass all those column names into the by parameter. Moreover, I have written a comprehensive article that explains various methods for joining multiple columns.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"))
             
df2

Yields below output.

Perform Inner Join in Another Way

Alternatively, you can use the R base merge() function to perform an inner join operation on different columns of two data frames. To do that, you can use the by.x and by.y parameters to specify which columns to match from each data frame.


# R join multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"))
df2

Because given data frames have identical column names, we get the same result as before. I have prepared another article where I describe how to join data frames with different column names.

Using inner_join() function

The most efficient way to perform an inner join on two data frames is by using the inner_join() function from the dplyr package. To utilize dplyr, ensure it’s installed by running install.packages(‘dplyr’), and then load it with library(dplyr).

In the dplyr package, functions generally require a data frame as a starting parameter. If you want to work with dplyr, the %>% operator from the magrittr package is commonly used. This operator “pipes” the output from the left part into the starting parameter of the function on the right part. For example, x %>% f(y) translates to f(x, y), so the output from the left side is automatically passed to the right side.

Perform Inner Join on Multiple Columns

You can use the inner_join() function to perform an inner join on two data frames based on two common columns producing a new data frame that contains only the rows where both data frames have matching values for these columns.


# Load dplyr package
library(dplyr)

# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output.

Perform Inner Join on Different Column Names

Alternatively, You can use the inner_join() function to perform an inner join on two data frames based on two different columns, returning a new data frame with rows where the specified keys match in both original data frames.


# Load dplyr package
library(dplyr)

# join on multiple columns
df2 <- emp_df %>% inner_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))df2

Yields the same output as above.

Perform Inner Join using reduce() from tidyverse

Finally, you can use the reduce() function is a part of the tidyverse package to join multiple data frames. To perform an inner join operation using the reduce() function, you need to use the inner_join keyword.


# Load tidyverse package
library(tidyverse)

# Inner Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2

Conclusion

This article has shown how to conduct an inner join on two data frames using the R base merge() function, the inner_join() function from the dplyr package, and the reduce() function from the tidyverse package. When dealing with larger datasets, the dplyr method is preferable because it offers better performance compared to the base R approach.

Quick Examples of Inner Join

Perform Inner Join by R Base

Perform Inner Join on Multiple Columns

Perform Inner Join in Another Way

Using inner_join() function

Perform Inner Join on Multiple Columns

Perform Inner Join on Different Column Names

Perform Inner Join using reduce() from tidyverse

Conclusion

Related Articles