To perform an outer join on data frames in R, you can use the merge()
function, full_join()
from the dplyr
package, or reduce()
from tidyverse
. Among these, the dplyr
approach is recommended because it performs better than base R methods. The dplyr package includes various functions designed to join data frames in R.
The outer join returns all rows from both DataFrames, where the join expression doesn’t match it returns NA on respective columns, note that the outer join is also called a full outer join.
Quick Examples of Full Outer Join
Below are quick examples of performing the full outer join on data frames.
# Quick Examples
# Outer join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all=TRUE)
# Outer join on multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"), all=TRUE)
# Outer join on different columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all=TRUE)
# Load dplyr package
library(dplyr)
# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id','dept_branch_id'))
# Using dplyr - outer join on different columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
# Load tidyverse package
library(tidyverse)
# Outer Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2
Let’s initiate two data frames for this example. The emp_df
and dept_df
data frames both contain the columns dept_id
and dept_branch_id
.
# Create emp Data Frame
emp_df=data.frame(
emp_id=c(1,2,3,4,5,6),
name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
superior_emp_id=c(-1,1,1,2,2,2),
dept_id=c(10,20,10,10,40,50),
dept_branch_id= c(101,102,101,101,104,105)
)
# Create dept Data Frame
dept_df=data.frame(
dept_id=c(10,20,30,40),
dept_name=c("Finance","Marketing","Sales","IT"),
dept_branch_id= c(101,102,103,104)
)
emp_df
dept_df
Yields below output.
Perform R Outer Join
The R base package includes a merge()
function that allows to perform an outer join, or a full outer join, on two or more data frames.
In this example, we will perform an outer join on the dept_id
column from the emp_df
and dept_df
data frames, using the all=TRUE
parameter.
# Outer join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all=TRUE)
df2
If you perform a join operation and there are columns with the same names that don’t participate in the join condition, the result will add suffixes like x
and y
to distinguish them. In this example, observe how the dept_branch_id
column gets suffixed with x
and y
in the result.
From our emp
dataset’s dept_id
with value 50 doesn’t have a record on dept
hence dept columns have NA and dept_id
30 doesn’t have a record in emp
hence you see NA on emp columns. Below is the result of the above Join expression.
Perform Outer Join on Several Columns
To execute an outer join on multiple columns with matching names in both data frames, provide a list of all the column names to the by
parameter. I have also written a detailed article that explains different methods to join on multiple columns.
# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"),
all=TRUE)
df2
Yields below output.
Outer Join with Different Column Names
If you need to perform an outer join on data frames that have different column names, you can specify which columns to use for the join with the by.x
and by.y
parameters. This allows to merge the data frames on specific columns, even if they have different names.
# R outer join multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all=TRUE)
df2
Because our data frame contains the same column names, it generates the same result as mentioned earlier. I have written another article where I describe how to join data frames with different column names.
Use full_join() Function to Perform Outer Join
To perform an outer join in R, you can use the full_join()
function from the dplyr
package. First, ensure that dplyr
is installed with install.packages(‘dplyr’), then load it into your R session with library(dplyr)
.
All functions in dplyr
take a data frame as their first argument. The magrittr
package provides the %>%
infix operator, commonly used with dplyr
, which allows you to “pipe” data from one operation to the next. This operator passes the result on its left to the first argument of the function on its right. So, x %>% f(y)
is equivalent to f(x, y)
.
Perform Outer Join on Several Columns
You can use dplyr
package to perform an outer join on multiple columns from two data frames, creating a new data frame that contains all rows from both given data frames, with missing values filled with NA
.
# Load dplyr package
library(dplyr)
# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id','dept_branch_id'))
df2
Yields below output. Since we are using both columns on the join we are not seeing duplicate columns.
Perform Outer Join on Different Columns
Similarly, you can use the dplyr
package to perform the outer join on different columns of two data frames. It will return a new data frame that contains all rows from both given data frames, with missing values filled with NA
.
# Load dplyr package
library(dplyr)
# Join on different column names
df2 <- emp_df %>% full_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
df2
Yields the same output as above.
Perform R Outer Join using reduce() from tidyverse
To join multiple data frames with the tidyverse package, you can use the reduce()
function. To create an outer join, you need to specify using full_join
argument. To perform a full outer join on multiple data frames, simply provide all the data frames as a list to the reduce() function.
# Load tidyverse package
library(tidyverse)
# Outer Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2
Conclusion
In this article, I have explained different methods for performing a full outer join on two data frames in R, including the base R merge()
function, the full_join()
function from the dplyr package, and the reduce()
function from the tidyverse package. The dplyr approach is generally more efficient for joining larger datasets compared to base R methods.