You are currently viewing R Outer Join of Data Frames

To perform an outer join on data frames in R, you can use the merge() function, full_join() from the dplyr package, or reduce() from tidyverse. Among these, the dplyr approach is recommended because it performs better than base R methods. The dplyr package includes various functions designed to join data frames in R.

The outer join returns all rows from both DataFrames, where the join expression doesn’t match it returns NA on respective columns, note that the outer join is also called a full outer join.

Quick Examples of Full Outer Join

Below are quick examples of performing the full outer join on data frames.


# Quick Examples

# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id", all=TRUE)

# Outer join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
        by=c("dept_id","dept_branch_id"), all=TRUE)

# Outer join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"),
      all=TRUE)

# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - outer join on different columns
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

Let’s initiate two data frames for this example. The emp_df and dept_df data frames both contain the columns dept_id and dept_branch_id.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r outer join

Perform R Outer Join

The R base package includes a merge() function that allows to perform an outer join, or a full outer join, on two or more data frames.

In this example, we will perform an outer join on the dept_id column from the emp_df and dept_df data frames, using the all=TRUE parameter.


# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
          by="dept_id", all=TRUE)
df2

If you perform a join operation and there are columns with the same names that don’t participate in the join condition, the result will add suffixes like x and y to distinguish them. In this example, observe how the dept_branch_id column gets suffixed with x and y in the result.

From our emp dataset’s dept_id with value 50 doesn’t have a record on dept hence dept columns have NA and dept_id 30 doesn’t have a record in emp hence you see NA on emp columns. Below is the result of the above Join expression.

r outer join

Perform Outer Join on Several Columns

To execute an outer join on multiple columns with matching names in both data frames, provide a list of all the column names to the by parameter. I have also written a detailed article that explains different methods to join on multiple columns.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"),
             all=TRUE)
             
df2

Yields below output.

r outer join multiple columns

Outer Join with Different Column Names

If you need to perform an outer join on data frames that have different column names, you can specify which columns to use for the join with the by.x and by.y parameters. This allows to merge the data frames on specific columns, even if they have different names.


# R outer join multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
                by.x=c("dept_id","dept_branch_id"), 
                by.y=c("dept_id","dept_branch_id"),
                all=TRUE)
df2

Because our data frame contains the same column names, it generates the same result as mentioned earlier. I have written another article where I describe how to join data frames with different column names.

Use full_join() Function to Perform Outer Join

To perform an outer join in R, you can use the full_join() function from the dplyr package. First, ensure that dplyr is installed with install.packages(‘dplyr’), then load it into your R session with library(dplyr).

All functions in dplyr take a data frame as their first argument. The magrittr package provides the %>% infix operator, commonly used with dplyr, which allows you to “pipe” data from one operation to the next. This operator passes the result on its left to the first argument of the function on its right. So, x %>% f(y) is equivalent to f(x, y).

Perform Outer Join on Several Columns

You can use dplyr package to perform an outer join on multiple columns from two data frames, creating a new data frame that contains all rows from both given data frames, with missing values filled with NA.


# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output. Since we are using both columns on the join we are not seeing duplicate columns.

Perform Outer Join on Different Columns

Similarly, you can use the dplyr package to perform the outer join on different columns of two data frames. It will return a new data frame that contains all rows from both given data frames, with missing values filled with NA.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

Perform R Outer Join using reduce() from tidyverse

To join multiple data frames with the tidyverse package, you can use the reduce() function. To create an outer join, you need to specify using full_join argument. To perform a full outer join on multiple data frames, simply provide all the data frames as a list to the reduce() function.


# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

Conclusion

In this article, I have explained different methods for performing a full outer join on two data frames in R, including the base R merge() function, the full_join() function from the dplyr package, and the reduce() function from the tidyverse package. The dplyr approach is generally more efficient for joining larger datasets compared to base R methods.