You are currently viewing R Anti Join of Data Frames

You can perform anti-join or left anti-join on R data frames using the dplyr anti_join() function and reduce() function from the tidyverse package. You can use various functions of the dplyr package to perform join operations on the data frame in R.

Advertisements

R Anti-join is the opposite of the semi-join and returns only the rows from the first data frame that do not have a matching entry in the second data frame, based on specified columns.

Let’s create two Data Frames named emp_df and dept_df and having common columns named dept_id and dept_branch_id.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r anti join

2. Using dplyr anti_join() in R

The anti_join() function of the R dplyr package is the best approach to perform the anti-join operations between two data frames. Before going to use the dplyr package you need to install it using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in the dplyr package require the data frame as the first argument. When using the dplyr package, we mostly use the infix operator %>% which takes the output from one function and uses it as the first argument to the next function in the chain. This allows us to perform a sequence of operations on data frames with ease.

2.1 Anti-Join on Multiple Columns

An anti-join returns all the rows from the left data frame that are not present in the right data frame, based on specified columns. When we perform an anti-join operation on two data frames having the same multiple column names, you can pass a list of specified column names into by parameter it will return the new data frame containing all rows from the left data frame that do not have matching rows in the right data frame based on the specified columns.


# Load dplyr package
library(dplyr)

# Using dplyr - anti join multiple columns
df2 <- emp_df %>% anti_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output.

r anti join multiple columns

2.2 Anti Join on Different Column Names

An alternative method to get the desired output is performing an anti-join operation on different column names of two data frames. To do this, we need to specify a by parameter with a named vector where the names correspond to columns from the first data frame and the values correspond to columns from the second data frame. This will return a new data frame containing all rows from the left data frame that do not have matching rows in the right data frame based on the specified columns.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% anti_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

3. Using tidyverse Package

You can use the anti_join() function from the tidyverse package to perform an anti-join operation on multiple data frames in R. To do this, you can pass the anti_join keyword and a by parameter(that specifies a column name to join on) into the reduce() function. The resulting data frame will contain all the rows from the first data frame that do not have matching rows in the second data frame.

Before going to use the dplyr package you need to install it using install.packages(‘tidyverse’) and load it using library(tidyverse).


# Load tidyverse package
library(tidyverse)

# Anti Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(anti_join, by='dept_id')
df2

Yields the same output as above.

4. Conclusion

In this article, I have explained how to use anti_join() and reduce() functions to perform anti-join operations on two data frames in R using the dplyr and tidyverse packages.

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium