You are currently viewing R Semi Join on Two Data Frames

Hot to do semi-join in R data frames? To perform semi join use either dplyr semi_join() function, or use reduce() from tidyverse. dplyr package provides several functions to join data frames in R.

R Semi join is similar to inner join difference being semi-join returns all columns from the left DataFrame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

1. Quick Examples of Semi Join

Following are quick examples of performing the semi join on data frames.


# Quick Examples

# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - semi join on different columns
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# semi Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

Let’s create two Data Frames, in the below example dept_id and dept_branch_id columns exist on both emp_df and dept_df data frames.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r semi join

2. Using dplyr semi_join() in R

Using the semi_join() function from the R dplyr package is the best approach to performing the semi join on two data frames. In order to use dplyr, you have to install it first using install.packages(‘dplyr’) and load it using library(dplyr).

All functions in dplyr package take data.frame as a first argument. When we use dplyr package, we mostly use the infix operator %>% from magrittr, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example, x %>% f(y) converted into f(x, y) so the result from the left-hand side is then “piped” into the right-hand side. 

2.1 Semi Join on Multiple Columns

Semi join in R returns columns from only the left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets. To perform a semi join on multiple columns with the same names on both R data frames, use all the column names as a list to by param. I have also created a dedicated article where I have explained how to perform join on multiple columns using several ways.


# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output.

r semi join multiple columns

2.2 Semi Join on Different Column Names

Sometimes you will have data frames with different column names and you wanted to perform an outer join on these columns, to do so specify the join conditions using by param.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

3. Using tidyverse Package

By using reduce() function from tidyverse package you can perform join on multiple data frames, to perform semi join use semi_join keyword in R. If you wanted to do semi-join on multiple data frames, pass all data frames as a list to reduce() function.


# Load tidyverse package
library(tidyverse)

# Semi Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

4. Conclusion

In this article, you have learned how to perform a semi-join on two data frames using semi_join() functions from the R dplyr package, and reduce() from the tidyverse package.

Related Articles

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium