Site icon Spark By {Examples}

R Semi Join on Two Data Frames

r semi join

How to perform semi-join on two data frames in R? To perform a semi-join operation you can use the semi_join() function from the dplyr, or the reduce() function from the tidyverse package. dplyr package, which provides a set of functions for manipulating and joining data frames in R.

R Semi join is similar to inner join difference being semi-join returns all columns from the left data frame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on the join expression, records not matched on the join expression are ignored from both left and right datasets.

Key Points-

1. Quick Examples of Performing Semi-Join Operations in R

Below are quick examples of performing the semi-join operations on data frames.


# Quick Examples

# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
           by=c('dept_id','dept_branch_id'))

# Using dplyr - semi join on different columns
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))

# Load tidyverse package
library(tidyverse)

# semi Join  data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

Let’s create two data frames having common columns and perform a semi-join operation,


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r semi join

2. Perform Semi Join using dplyr in R

The semi_join() function from the R dplyr package is an effective way to perform a semi-join on two data frames. To use dplyr, you will need to install it with install.packages(‘dplyr’) and then load it into your R environment with library(dplyr).

Every function in dplyr requires a data frame as its first argument. When working with dplyr, the infix operator, provided by the magrittr package, is commonly used to “pipe” the output from the left side to the first argument of the function on the right side. For example, the expression x %>% f(y) is equivalent to f(x, y), which means that the output from the left side is passed into the function on the right side.

2.1 Semi-Join on Multiple Columns

A semi-join is a joining operation in R which extracts rows from a left data frame that have corresponding matches in a right data frame, based on a specified condition and ignores unmatched records from both data frames.

When we perform a semi-join operation using the semi_join() on multiple common columns of both R data frames, you can use the by parameter to specify these columns as a list. It returns the resulting data frame which contains only columns of the left data frame.


# Load dplyr package
library(dplyr)

# Using dplyr - semi join multiple columns
df2 <- emp_df %>% semi_join( dept_df, 
            by=c('dept_id','dept_branch_id'))
df2

Yields below output.

r semi join multiple columns

2.2 Semi-Join on Different Column Names

If you have data frames with different column names and you need to perform an semi join on those columns, you can set the joining conditions with the by parameter.


# Load dplyr package
library(dplyr)

# Join on different column names
df2 <- emp_df %>% semi_join( dept_df, 
        by=c('dept_id'='dept_id', 
             'dept_branch_id'='dept_branch_id'))
df2

Yields the same output as above.

3. Using tidyverse Package

To join multiple data frames in R, you can use the reduce() function from the tidyverse package along with the semi_join keyword. If you want to perform a semi-join on a list of data frames, you can pass them to the reduce() function. This will return a new data frame containing all the columns from the left data frame with matching records from the right data frame.


# Load tidyverse package
library(tidyverse)

# Use reduce() perform semi Join data frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(semi_join, by='dept_id')
df2

Yields the same output as above.

4. Conclusion

In this article, I have explained how to perform a semi-join operation on two data frames using semi_join() functions from the R dplyr package, and reduce() from the tidyverse package. Also explained how to extract specific subsets of data based on matching records in another dataset.

Exit mobile version