How can you perform an inner join on data frames in R? You can use the merge()
function, the inner_join()
function from dplyr
, or the reduce()
function from tidyverse
. The inner_join()
function in dplyr
is typically the most efficient option because it generally runs faster than the base R methods. This package provides a variety of functions for joining data frames in R.
In R, Inner join or natural join is the default join and it’s mostly used in joining data frames, it is used to join data.frames on a specified single or multiple columns, and where column values don’t match the rows get dropped from both data.frames (emp
& dept
).
Quick Examples of Inner Join
Here are some quick examples of how to perform inner joins on data frames.
# Quick Examples
# Inner join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id")
# Inner join on multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"))
# Inner join on different columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"))
# Load dplyr package
library(dplyr)
# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df,
by=c('dept_id','dept_branch_id'))
# Using dplyr - inner join on different columns
df2 <- emp_df %>% inner_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
# Load tidyverse package
library(tidyverse)
# Inner Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2
Let’s build two data frames. In this case, both emp_df
and dept_df
have the dept_id
and dept_branch_id
columns.
# Create emp Data Frame
emp_df=data.frame(
emp_id=c(1,2,3,4,5,6),
name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
superior_emp_id=c(-1,1,1,2,2,2),
dept_id=c(10,20,10,10,40,50),
dept_branch_id= c(101,102,101,101,104,105)
)
# Create dept Data Frame
dept_df=data.frame(
dept_id=c(10,20,30,40),
dept_name=c("Finance","Marketing","Sales","IT"),
dept_branch_id= c(101,102,103,104)
)
emp_df
dept_df
Yields below output.
Perform Inner Join by R Base
The R base package includes a merge()
function which performs an inner join on two or more data frames. With this function, you pass the left data frame as x
, the right one as y
, and then set the by
parameter to specify which column(s) to join on.
Here’s an example where we perform an inner join using the dept_id
column from the emp_df
and dept_df
data frames. Here by default, it uses all=FALSE
. This join is similar to a set intersection operation.
# inner join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id")
df2
The output shown below demonstrates that if the data frames you are merging have the same column names not involved in the join condition, the merge()
function adds suffixes x
and y
to these columns in the resulting data frame. In the example below, note how the dept_branch_id
column is displayed.
Perform Inner Join on Multiple Columns
To execute an inner join on multiple columns with identical names in both data frames, pass all those column names into the by
parameter. Moreover, I have written a comprehensive article that explains various methods for joining multiple columns.
# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"))
df2
Yields below output.
Perform Inner Join in Another Way
Alternatively, you can use the R base merge() function to perform an inner join operation on different columns of two data frames. To do that, you can use the by.x
and by.y
parameters to specify which columns to match from each data frame.
# R join multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"))
df2
Because given data frames have identical column names, we get the same result as before. I have prepared another article where I describe how to join data frames with different column names.
Using inner_join() function
The most efficient way to perform an inner join on two data frames is by using the inner_join()
function from the dplyr
package. To utilize dplyr
, ensure it’s installed by running install.packages(‘dplyr’), and then load it with library(dplyr)
.
In the dplyr package, functions generally require a data frame as a starting parameter. If you want to work with dplyr
, the %>%
operator from the magrittr
package is commonly used. This operator “pipes” the output from the left part into the starting parameter of the function on the right part. For example, x %>% f(y)
translates to f(x, y)
, so the output from the left side is automatically passed to the right side.
Perform Inner Join on Multiple Columns
You can use the inner_join() function to perform an inner join on two data frames based on two common columns producing a new data frame that contains only the rows where both data frames have matching values for these columns.
# Load dplyr package
library(dplyr)
# Using dplyr - inner join multiple columns
df2 <- emp_df %>% inner_join( dept_df,
by=c('dept_id','dept_branch_id'))
df2
Yields below output.
Perform Inner Join on Different Column Names
Alternatively, You can use the inner_join() function to perform an inner join on two data frames based on two different columns, returning a new data frame with rows where the specified keys match in both original data frames.
# Load dplyr package
library(dplyr)
# join on multiple columns
df2 <- emp_df %>% inner_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))df2
Yields the same output as above.
Perform Inner Join using reduce() from tidyverse
Finally, you can use the reduce()
function is a part of the tidyverse
package to join multiple data frames. To perform an inner join operation using the reduce()
function, you need to use the inner_join
keyword.
# Load tidyverse package
library(tidyverse)
# Inner Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(inner_join, by='dept_id')
df2
Conclusion
This article has shown how to conduct an inner join on two data frames using the R base merge()
function, the inner_join()
function from the dplyr package, and the reduce()
function from the tidyverse package. When dealing with larger datasets, the dplyr method is preferable because it offers better performance compared to the base R approach.
Related Articles
- R Semi Join
- R Anti Join
- R Outer Join
- R Right Join
- R Left Join
- R Semi Join on Two Data Frames
- Nested For Loop in R
- Break and Next (Continue) Statements in R