To perform an outer join on data frames in R, you can use the merge()
function, full_join()
from the dplyr
package, or reduce()
from tidyverse
. Among these, the dplyr
approach is preferable because it performs better than base R methods. The dplyr package includes various functions designed to join data frames in R.
The outer join returns all rows from both DataFrames, where the join expression doesn’t match it returns NA on respective columns, note that the outer join is also called a full outer join.
Brief Examples of Full Outer Join
Below are examples of performing the full outer join on data frames.
# Quick Examples
# Outer join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all=TRUE)
# Outer join on multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"), all=TRUE)
# Outer join on different columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all=TRUE)
# Load dplyr package
library(dplyr)
# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id','dept_branch_id'))
# Using dplyr - outer join on different columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
# Load tidyverse package
library(tidyverse)
# Outer Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2
Let’s initiate two data frames for this example. The emp_df
and dept_df
data frames both contain the columns dept_id
and dept_branch_id
.
# Create emp Data Frame
emp_df=data.frame(
emp_id=c(1,2,3,4,5,6),
name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
superior_emp_id=c(-1,1,1,2,2,2),
dept_id=c(10,20,10,10,40,50),
dept_branch_id= c(101,102,101,101,104,105)
)
# Create dept Data Frame
dept_df=data.frame(
dept_id=c(10,20,30,40),
dept_name=c("Finance","Marketing","Sales","IT"),
dept_branch_id= c(101,102,103,104)
)
emp_df
dept_df
Yields below output.
Perform R Outer Join
The R base merge()
function that allows to perform an outer join, or a full outer join, on multiple data frames.
In this example, we will perform an outer join on the dept_id
column from the emp_df
and dept_df
data frames, using the all=TRUE
parameter.
# Outer join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all=TRUE)
df2
If you perform a join operation and there are columns with the same names that don’t participate in the join condition, the result will add suffixes like x
and y
to distinguish them. In this example, observe how the dept_branch_id
column gets suffixed with x
and y
in the result.
There’s no record in the dept
column for the dept_id
value of 50
in our dataset, resulting in NA
entries in the dept
column. Similarly, the emp
columns display NA
for the dept_id
value of 30
because there’s no corresponding record in the emp
dataset.
Perform Outer Join on Several Columns
When we run an outer join on two or more columns with matching names in both data frames, provide a list of all the column names to the by
parameter. I have also written a detailed article that explains different methods to join on multiple columns.
# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"),
all=TRUE)
df2
Yields below output.
Perform Outer Join on Columns of Different
To perform an outer join on data frames that have different column names, you can specify which columns to use for the join with the by.x
and by.y
parameters. This allows to merging of the data frames on specific columns, even if they have different names.
# R outer join multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all=TRUE)
df2
Hence, given data frame contains the common column names, it generates a similar result as mentioned earlier.
Use full_join() Function to Perform Outer Join
You can implement an outer join in R, using the full_join()
method of the dplyr
package. First, ensure that dplyr
is installed with install.packages(‘dplyr’), then load it into your R session with library(dplyr)
.
All functions of dplyr take a df
as their first parameter. The magrittr
package provides the %>%
infix operator, commonly used with dplyr
, which allows you to “pipe” data from one operation to the next. This operator passes the result on its left to the first parameter of the function on its right. So, x %>% f(y)
is equivalent to f(x, y)
.
Perform Outer Join on Several Columns
You can use dplyr
package to perform an outer join on multiple columns from two data frames, creating a new data frame having all rows from both given data frames, with missing values filled with NA
.
# Load dplyr package
library(dplyr)
# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id','dept_branch_id'))
df2
Yields below output. Since both columns are used in the join, we do not see any duplicate columns.
Perform Outer Join on Different Columns
Similarly, you can use the dplyr
package to perform the outer join on different columns of two data frames. It will return a new data frame that contains all rows from both given data frames, with missing values filled with NA
.
# Load dplyr package
library(dplyr)
# Join on different column names
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
df2
Yields the same output as above.
Implement R Outer Join by reduce()
To join multiple data frames with the tidyverse package, you can use the reduce()
function. To create an outer join, you need to specify using full_join
argument. To perform a full outer join on multiple data frames, simply provide all the data frames as a list to this function.
# Load tidyverse package
library(tidyverse)
# Outer Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2
Conclusion
In this article, I have explained different methods for performing a full outer join on two data frames in R, including the base R merge()
function, the full_join()
function from the dplyr package, and the reduce()
function from the tidyverse package. The dplyr method is generally more efficient for joining larger datasets compared to base R methods.
Related Articles
- R Semi Join
- R Anti Join
- R Right Join
- R Left Join
- R Inner Join
- Join data frames with different column names