You are currently viewing R Semi Join on Two Data Frames

How do you perform a semi-join on two data frames in R? You can use the semi_join() function from the dplyr package, which offers a set of functions for data manipulation and joining in R. Alternatively, you can use the reduce() function from the tidyverse package.

Advertisements

R Semi join is similar to inner join difference being semi-join returns all columns from the left data frame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on the join expression, records not matched on the join expression are ignored from both left and right datasets.

Key Points-

  • A semi-join keeps only the rows from the first data frame that match records in the second data frame, according to a specified condition or key, without adding any columns from the second data frame.
  • Unlike inner joins, semi-joins do not combine columns from both datasets. They retain only the columns from the first dataset.
  • Semi-joins are useful for filtering datasets based on records that match another dataset’s specific criteria.
  • These joins are effective for identifying common records between two data frames.
  • To use the semi-join() function, you need to have the dplyr package installed in your R environment.
  • The reduce() function from the tidyverse package can be used to semi-join multiple data frames in a list.

1. Quick Examples of Performing Semi-Join Operations in R

Below are quick examples of performing the semi-join operations on data frames.


# Quick Examples

# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - semi join on different columns
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# semi Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

Let’s create two data frames having common columns and perform a semi-join operation,


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r semi join

2. Perform Semi Join using dplyr in R

The semi_join() function from the R dplyr package is an effective way to perform a semi-join on two data frames. To use dplyr, you will need to install it with install.packages(‘dplyr’) and then load it into your R environment with library(dplyr).

Every function in dplyr requires a data frame as its first argument. When working with dplyr, the infix operator, provided by the magrittr package, is commonly used to “pipe” the result from the left side to the first argument of the function on the right side. For instance, the expression x %>% f(y) is equivalent to f(x, y), which means that the output from the left side is passed into the function on the right side.

2.1 Semi-Join on Multiple Columns

A semi-join is a joining operation in R which extracts rows from a left data frame that have corresponding matches in a right data frame, based on a specified condition and ignores unmatched records from both data frames.

When we perform a semi-join operation using the semi_join() on multiple common columns of both R data frames, you can use the by parameter to specify these columns as a list. It returns the resulting data frame which contains only columns of the left data frame.


# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output.

r semi join multiple columns

2.2 Perform Semi-Join in another Way

Alternatively, you can perform a semi-join operation on different columns of two data frames. For that, you need to specify the joining conditions with the by parameter.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

3. Perform by Applying reduce() function

To join multiple data frames in R, you can use the reduce() function which is part of the tidyverse package along with the semi_join keyword. If you want to perform a semi-join on a list of data frames, you can pass them to the reduce() function. This will return a new data frame containing all the columns from the left data frame with matching records from the right data frame.


# Load tidyverse package
library(tidyverse)

# Use reduce() perform semi Join data frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

Yields the same output as above.

4. Conclusion

In this article, I have explained how to perform a semi-join operation on two data frames using semi_join() functions from the R dplyr package, and reduce() from the tidyverse package. Also explained how to extract specific subsets of data based on matching records in another dataset.