You are currently viewing R Outer Join of Data Frames

How to do outer join on data frames in R? To perform outer join or full outer join use either merge() function, dplyr full_join() function, or use reduce() from tidyverse. Using the dplyr function is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join data frames in R.

In R, the outer join returns all rows from both DataFrames, where the join expression doesn’t match it returns NA on respective columns, note that the outer join is also called a full outer join.

1. Quick Examples of Full Outer Join

Following are quick examples of performing the full outer join on data frames.


# Quick Examples

# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id", all=TRUE)

# Outer join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
        by=c("dept_id","dept_branch_id"), all=TRUE)

# Outer join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"),
      all=TRUE)

# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - outer join on different columns
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

Let’s create two Data Frames, in the below example dept_id and dept_branch_id columns exists on both emp_df and dept_df data frames.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r outer join

2. Outer Join using merge()

R base provides a merge() function that is used to perform an outer join or full outer join on two, three or more (multiple) data frames.

The following example performs an outer join on the column dept_id column on emp_df and dept_df data frames. Use all=TRUE.


# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
          by="dept_id", all=TRUE)
df2

Yields below output. if you have the same column names that are not used in the join condition, it suffixes the x and y to the columns on the result. In the below example check dept_branch_id.

From our emp dataset’s dept_id with value 50 doesn’t have a record on dept hence dept columns have NA and dept_id 30 doesn’t have a record in emp hence you see NA on emp columns. Below is the result of the above Join expression.

r outer join

2.1 Outer Join on Multiple Columns

To perform an outer join on multiple columns with the same names on both data frames, use all the column names as a list to by param. I have also created a dedicated article where I have explained how to perform join on multiple columns using several ways.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"),
             all=TRUE)
             
df2

Yields below output.

r outer join multiple columns

2.2 Outer Join Different Column Names

Sometimes you will have data frames with different column names and you wanted to perform an outer join on these columns, to do so specify the column names from both data frames with params by.x and by.y.


# R outer join multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
                by.x=c("dept_id","dept_branch_id"), 
                by.y=c("dept_id","dept_branch_id"),
                all=TRUE)
df2

Since our data frame has the same column names, it results in the same output as above, I have created another article where I have explained how to perform join on different column names.

3. Using dplyr to Perform Outer Join in R

Using the full_join() function from the dplyr package is the best approach to performing the outer join on two data frames. In order to use dplyr, you have to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 

3.1 Multiple Columns with dplyr package


# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output. Since we are using both columns on the join we are not seeing duplicate columns.

3.2 Using Different Column Names with dplyr package


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

4. Using tidyverse Package

By using reduce() function from tidyverse package you can perform join on multiple data frames, to perform outer join use full_join keyword. If you wanted to do full outer join on multiple data frames, pass all data frames as a list to reduce() function.


# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

5. Conclusion

In this article, you have learned how to perform a full outer join on two data frames using the R base merge() function, full_join() functions from the dplyr package, and reduce() from the tidyverse package. Using dplyr approach is the best to use when you are joining on larger datasets as it performs efficiently over the R base.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium