R Join or Merge Data Frames

By using the merge() function we can perform join on data frames in R programming. This merge() function supports all basic SQL Join Types like inner join, left or left outer join, right or right outer join, full outer join and cross join. If there are more than two data frames to be joined, then you can use reduce() method available in tidyverse library, this package also supports all other joins in R programming.

To perform Anti and Semi joins use dplyr package functions anti_join() and semi_join(). In this tutorial, you will learn different ways and methods to join Data Frames using R examples. Please access Join on Multiple DataFrames in case you wanted to join more than two DataFrames.

R base uses merge() join to perform most of the join by changing the values to parameters all and by.

JoinR Basedplyr Package
Inner Joinall=FALSEinner_join()
Left or Left Outer Joinall.x=TRUEleft_join()
Right or Right Outer Joinall.y=TRUEright_join()
Full Outer Joinall=TRUEfull_join()
Cross Joinby=NULL,
all=FALSE
Not Available
Left Anti JoinNot Availableanti_join()
Left Semi JoinNot Availablesemi_join()
R Join Data Frames

1. merge() Syntax

Below is the merge() syntax that is used to join data.frames.


# merge() Syntax
merge(x, y, …)
# S3 method for default
merge(x, y, …)
# S3 method for data.frame
merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, …)

The rest of the tutorial explains Join Types using the above syntaxes

Before we jump into R Join examples, first, let’s create an emp and dept DataFrame’s. here, column emp_id is unique on emp and dept_id is unique on the dept dataset and dept_id from emp has a reference to dept_id on dept.


# Create Emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT")
)

df
dept_df

This print emp and dept DataFrame to the console.

r join data.frame

2. Inner Join

In R, Inner join or natural join is the default join and it’s mostly used joining data frames, it is used to join data.frames on a specified column, and where column values don’t match the rows get dropped from both data.frames (emp & dept). Here by default, it uses all=FALSE. This join is similar to a set intersection.


# R Inner Join
df2 <- merge(x = emp_df, y = dept_df, by = "dept_id")
print(df2)

When we apply Inner join on our datasets, it drops dept_id 50 from emp and dept_id 30 from dept datasets. Below is the result of the above Join expression.

r inner join

Similarly, use inner_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# inner join using dplyr package
df2 = emp_df %>% inner_join(dept_df, by = "dept_id")

3. Left Outer Join

R left join returns all rows from the left data.frame regardless of the match found on the right data.frame when join expression doesn’t match, it assigns NA for that record and drops records from right where match not found.

Use all.x=TRUE to perform left outer join in R.


# R left outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.x=TRUE)
df2

From our dataset, dept_id 50 doesn’t have a record on dept dataset hence, this record contains NA on dept columns (dept_name & dept_id). and dept_id 30 from dept dataset dropped from the results. Below is the result of the above Join expression.

Similarly, use left_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# left join using dplyr package
df2 = emp_df %>% left_join(dept_df, by = "dept_id")

4. Right Outer Join

Use all.y=TRUE to perform right outer join, this is the opposite of left join, here it returns all rows from the right data.frame regardless of math found on the left dataset, when join expression doesn’t match, it assigns NA for that record and drops records from left where match not found.

Use all.y=TRUE to perform right outer join


# R right outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.y=TRUE)
df2

From our example, the right dataset dept_id 30 doesn’t have it on the left dataset emp hence, this record contains NA on emp columns. and dept_id 50 dropped as a match not found on left. Below is the result of the above Join expression.

r right outer join

Similarly, use right_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# right join using dplyr package
df2 = emp_df %>% right_join(dept_df, by = "dept_id")

5. Full Outer Join

Use all=TRUE to perform full outer join, it returns all rows from both DataFrames, where join expression doesn’t match it returns NA on respective record columns.


# R full outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all=TRUE)
df2

From our emp dataset’s dept_id with value 50 doesn’t have a record on dept hence dept columns have NA and dept_id 30 doesn’t have a record in emp hence you see NA on emp columns. Below is the result of the above Join expression.

r full outer join

Similarly, use full_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# full join using dplyr package
df2 = emp_df %>% full_join(dept_df, by = "dept_id")

6. Cross Join

Use by=NULL to perform cross join. A Cross Join also known as cartesian join which performs every row of one dataframe is being joined with every other row of another dataframe.


# R Cross join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = NULL)
df2

7. Left Semi Join

R semi join is similar to inner join difference being left semi join returns all columns from the left DataFrame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

The same result can be achieved using select on the result of the inner join however, using this join would be efficient. To perform left semi join in R use the semi_join() function from the dplyr package.


# Import dplyr
library(dplyr)

# left semi join
df2 = emp_df %>% semi_join(dept_df, by = "dept_id")
df2

Below is the result of the above join expression.

left semi join

8. Left Anti Join

R anti join does the exact opposite of the left semi join, left anti join returns only columns from the left DataFrame for non-matched records. To perform left anti join in R use the anti_join() function from the dplyr package.

In other words, it selects all rows from the left data frame that are not present in the right data frame (similar to left df – right df).


# Import dplyr package
library(dplyr)

# left anti join data.frames
df2 = emp_df %>% anti_join(dept_df, by = "dept_id")
df2

Yields below output

left anti join

9. Source Code | R Example


# Create Emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT")
)

# inner join data.frames
df2 <- merge(x = emp_df, y = dept_df, by = "dept_id")

# left outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.x=TRUE)

# right outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.y=TRUE)

# full outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all=TRUE)

# cross outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = NULL)

# Import dplyr package
library(dplyr)
# left semi join data.frames
df2 = emp_df %>% semi_join(dept_df, by = "dept_id")
df2

# left anti join data.frames
df2 = emp_df %>% anti_join(dept_df, by = "dept_id")
df2

The examples explained here are available at the GitHub project for reference.

Conclusion

In this tutorial, you have learned how to perform R join DataFrames using merge() function and dplyr package. Also, learned how to perform INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, and CROSS joins with examples.

Related Articles

References:

Happy Learning !!

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing R Join or Merge Data Frames