You are currently viewing R Outer Join of Data Frames

To perform an outer join on data frames in R, you can use the merge() function, full_join() from the dplyr package, or reduce() from tidyverse. Among these, the dplyr approach is preferable because it performs better than base R methods. The dplyr package includes various functions designed to join data frames in R.

The outer join returns all rows from both DataFrames, where the join expression doesn’t match it returns NA on respective columns, note that the outer join is also called a full outer join.

Brief Examples of Full Outer Join

Below are examples of performing the full outer join on data frames.


# Quick Examples

# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
             by="dept_id", all=TRUE)

# Outer join on multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
        by=c("dept_id","dept_branch_id"), all=TRUE)

# Outer join on different columns
df2 <- merge(x=emp_df,y=dept_df, 
      by.x=c("dept_id","dept_branch_id"), 
      by.y=c("dept_id","dept_branch_id"),
      all=TRUE)

# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - outer join on different columns
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

Let’s initiate two data frames for this example. The emp_df and dept_df data frames both contain the columns dept_id and dept_branch_id.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r outer join

Perform R Outer Join

The R base merge() function that allows to perform an outer join, or a full outer join, on multiple data frames.

In this example, we will perform an outer join on the dept_id column from the emp_df and dept_df data frames, using the all=TRUE parameter.


# Outer join
df2 <- merge(x=emp_df,y=dept_df, 
          by="dept_id", all=TRUE)
df2

If you perform a join operation and there are columns with the same names that don’t participate in the join condition, the result will add suffixes like x and y to distinguish them. In this example, observe how the dept_branch_id column gets suffixed with x and y in the result.

There’s no record in the dept column for the dept_id value of 50 in our dataset, resulting in NA entries in the dept column. Similarly, the emp columns display NA for the dept_id value of 30 because there’s no corresponding record in the emp dataset.

r outer join

Perform Outer Join on Several Columns

When we run an outer join on two or more columns with matching names in both data frames, provide a list of all the column names to the by parameter. I have also written a detailed article that explains different methods to join on multiple columns.


# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df, 
             by=c("dept_id","dept_branch_id"),
             all=TRUE)
             
df2

Yields below output.

r outer join multiple columns

Perform Outer Join on Columns of Different

To perform an outer join on data frames that have different column names, you can specify which columns to use for the join with the by.x and by.y parameters. This allows to merging of the data frames on specific columns, even if they have different names.


# R outer join multiple columns
df2 <- merge(x=emp_df,y=dept_df, 
                by.x=c("dept_id","dept_branch_id"), 
                by.y=c("dept_id","dept_branch_id"),
                all=TRUE)
df2

Hence, given data frame contains the common column names, it generates a similar result as mentioned earlier.

Use full_join() Function to Perform Outer Join

You can implement an outer join in R, using the full_join()method of the dplyr package. First, ensure that dplyr is installed with install.packages(‘dplyr’), then load it into your R session with library(dplyr).

All functions of dplyr take a df as their first parameter. The magrittr package provides the %>% infix operator, commonly used with dplyr, which allows you to “pipe” data from one operation to the next. This operator passes the result on its left to the first parameter of the function on its right. So, x %>% f(y) is equivalent to f(x, y).

Perform Outer Join on Several Columns

You can use dplyr package to perform an outer join on multiple columns from two data frames, creating a new data frame having all rows from both given data frames, with missing values filled with NA.


# Load dplyr package
library(dplyr)

# Using dplyr - outer join multiple columns
df2 <- emp_df %>% full_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output. Since both columns are used in the join, we do not see any duplicate columns.

Perform Outer Join on Different Columns

Similarly, you can use the dplyr package to perform the outer join on different columns of two data frames. It will return a new data frame that contains all rows from both given data frames, with missing values filled with NA.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% full_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

Implement R Outer Join by reduce()

To join multiple data frames with the tidyverse package, you can use the reduce() function. To create an outer join, you need to specify using full_join argument. To perform a full outer join on multiple data frames, simply provide all the data frames as a list to this function.


# Load tidyverse package
library(tidyverse)

# Outer Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(full_join, by='dept_id')
df2

Conclusion

In this article, I have explained different methods for performing a full outer join on two data frames in R, including the base R merge() function, the full_join() function from the dplyr package, and the reduce() function from the tidyverse package. The dplyr method is generally more efficient for joining larger datasets compared to base R methods.