You are currently viewing R Join or Merge Data Frames

By using the merge() function we can perform join on data frames in R programming. This merge() function supports all basic SQL Join Types like inner join, left or left outer join, right or right outer join, full outer join, and cross join. If there are more than two data frames to be joined, then you can use reduce() method available in tidyverse library, this package also supports all other joins in R programming.

To perform Anti and semi-joins use the dplyr package functions anti_join() and semi_join(). In this tutorial, you will learn different ways and methods to join Data Frames using R examples. Please access Join on Multiple DataFrames in case you want to join more than two DataFrames.

R base uses merge() join to perform most of the join by changing the values to parameters all and by.

JoinR Basedplyr Package
Inner Joinall=FALSEinner_join()
Left or Left Outer Joinall.x=TRUEleft_join()
Right or Right Outer Joinall.y=TRUEright_join()
Full Outer Joinall=TRUEfull_join()
Cross Joinby=NULL,
all=FALSE
Not Available
Left Anti JoinNot Availableanti_join()
Left Semi JoinNot Availablesemi_join()
R Join Data Frames

1. merge() Syntax

Below is the merge() syntax used to join the data.frames.


# Syntax of merge() function
merge(x, y, …)
# S3 method for default
merge(x, y, …)
# S3 method for data.frame
merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, …)

The rest of the tutorial explains Join Types using the above syntaxes.

Before we jump into R Join examples, first, let’s create an emp and dept Data frames. here, the column emp_id is unique on emp and dept_id is unique in the dept dataset and dept_id from emp has a reference to dept_id on dept.


# Create Emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT")
)

df
dept_df

This print emp and dept DataFrame to the console.

r join

2. Inner Join

In R, Inner join or natural join is the default join and it’s mostly used in joining data frames, it is used to join data.frames on a specified column, and where column values don’t match the rows get dropped from both data frames (emp & dept). Here by default, it uses all = FALSE. This join is similar to a set intersection.


# R Inner Join
df2 <- merge(x = emp_df, y = dept_df, by = "dept_id")
print(df2)

When we apply Inner Join to our datasets, it drops dept_id 50 from emp and dept_id 30 from dept datasets. Below is the result of the above Join expression.

r inner join

Similarly, use inner_join() from the dplyr package to get the same result.


# Import dplyr
library(dplyr)

# inner join using dplyr package
df2 = emp_df %>% inner_join(dept_df, by = "dept_id")

3. Left Outer Join

R left join is used to perform left joining on two data frames based on the common columns of single/multiple and return the data frame having all rows from the left data frame and any matching rows from the right data frame. Unmatched rows from the right data frame will have NA values in the columns from the right data frame.

Use all.x=TRUE to perform the left outer join in R.


# R left outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.x=TRUE)
df2

From our dataset, dept_id 50 doesn’t have a record on dept dataset hence, this record contains NA on dept columns (dept_name & dept_id). and dept_id 30 from dept dataset dropped from the results. Below is the result of the above Join expression.

r join

Similarly, use left_join() from the dplyr package to get the same result.


# Import dplyr
library(dplyr)

# left join using dplyr package
df2 = emp_df %>% left_join(dept_df, by = "dept_id")

4. Right Outer Join

Use all.y=TRUE to perform the right outer join, this is the opposite of left join, here it returns all rows from the right data.frame regardless of the match found on the left dataset, when the join expression doesn’t match, it assigns NA for that record and drops records from the left where a match is not found.

Use all.y=TRUE to perform the right outer join


# R right outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.y=TRUE)
df2

From our example, the right dataset dept_id 30 doesn’t have it on the left dataset emp hence, this record contains NA on emp columns. and dept_id 50 dropped as a match not found on the left. Below is the result of the above Join expression.

r right outer join

Similarly, use right_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# right join using dplyr package
df2 = emp_df %>% right_join(dept_df, by = "dept_id")

5. Full Outer Join

Use all=TRUE to perform full outer join, it returns all rows from both DataFrames, where join expression doesn’t match it returns NA on respective record columns.


# R full outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all=TRUE)
df2

From our emp dataset’s dept_id with value 50 doesn’t have a record on dept hence dept columns have NA and dept_id 30 doesn’t have a record in emp hence you see NA on emp columns. Below is the result of the above Join expression.

r full outer join

Similarly, use full_join() from dplyr package to get the same result.


# Import dplyr
library(dplyr)

# full join using dplyr package
df2 = emp_df %>% full_join(dept_df, by = "dept_id")

6. Cross Join

Use by=NULL to perform cross join. A Cross Join, also known as cartesian join, performs every row of one dataframe, is being joined with every other row of another dataframe.


# R Cross join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = NULL)
df2

7. Left Semi Join

R semi join is similar to inner join difference being left semi join returns all columns from the left DataFrame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

The same result can be achieved using select on the result of the inner join however, using this join would be efficient. To perform left semi join in R use the semi_join() function from the dplyr package.


# Import dplyr
library(dplyr)

# left semi join
df2 = emp_df %>% semi_join(dept_df, by = "dept_id")
df2

Below is the result of the above join expression.

r left semi join

8. Left Anti Join

R anti join does the exact opposite of the left semi join, left anti join returns only columns from the left DataFrame for non-matched records. To perform left anti join in R use the anti_join() function from the dplyr package.

In other words, it selects all rows from the left data frame that are not present in the right data frame (similar to left df – right df).


# Import dplyr package
library(dplyr)

# left anti join data.frames
df2 = emp_df %>% anti_join(dept_df, by = "dept_id")
df2

Yields below output

r left anti join

9. Source Code | R Example


# Create Emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT")
)

# inner join data.frames
df2 <- merge(x = emp_df, y = dept_df, by = "dept_id")

# left outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.x=TRUE)

# right outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all.y=TRUE)

# full outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = "dept_id", all=TRUE)

# cross outer join data.frames
df2 <- merge(x = emp_df, y = dept_df, 
             by = NULL)

# Import dplyr package
library(dplyr)
# left semi join data.frames
df2 = emp_df %>% semi_join(dept_df, by = "dept_id")
df2

# left anti join data.frames
df2 = emp_df %>% anti_join(dept_df, by = "dept_id")
df2

The examples explained here are available at the GitHub project for reference.

Conclusion

In this tutorial, you have learned how to perform R join DataFrames using merge() function and dplyr package. Also, learned how to perform INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, and CROSS joins with examples.

Related Articles

References:

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium