You are currently viewing R merge() Function


In R, the merge() function is commonly used to combine/merge two or more data frames based on common columns or key variables. This function supports various types of joins, such as inner join, left join, right join, and outer join, for users can allow to join the data. By default, the merge() function finds common columns. But make sure you can specify exactly which columns to use for merging. This function is similar to the join function in a Relational Database Management System (RDMS).

In this article, I will explain the merge() function and using its syntax, parameters, and usage how we can implement the below join operations on data frames that have common columns.

Types of Join operations-

  • Inner Join- (all = FALSE, default): An inner join only keeps the rows with matching values in the specified columns from both data frames. Rows containing non-matching values are omitted.
  • Left Join- (all.x = TRUE): A left join keeps all rows from the left (first) data frame and the matching rows from the right (second) data frame. Rows having non-matching values in the specified columns of the left data frame will be replaced with NA values for the corresponding columns from the right data frame.
  • Right Join- (all.y = TRUE): A right join retains all rows from the right (second) data frame and the matching rows from the left (first) data frame. In cases of non-matching values in the specified columns of the right data frame, NA values will appear in the corresponding columns from the left data frame.
  • Outer Join (all = TRUE): An outer join keeps all rows from both data frames and joins them based on matching values in the specified columns. Any non-matching values will be replaced with NA values in the corresponding columns of both data frames.

Syntax of the merge() Function

Following is the syntax of the merge() function.


# Syntax of the merge() function
merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE, 
      incomparables = NULL, …)

Parameters

  • x, y: Data frames or objects to be merged.
  • by: A character vector specifying the common column(s) to merge on. If not specified, it uses the intersection of the column names x and y.
  • by.x, by.y: Specify the columns from x and y to use for merging if the column names are different.
  • all: Logical. If TRUE, performs a full outer join, including rows from both x and y even if there is no match. The default value is FALSE.
  • all.x, all.y: Logical. If TRUE, include all rows from x or y, respectively, even if there is no match in the other data frame. Default is all.
  • suffixes: A character vector of length 2 specifying the suffixes to be appended to the overlapping column names in x and y.
  • sort: Logical. If TRUE, sort the merged data frame by the columns used for merging.
  • incomparables: Values that cannot be matched. If not NULL, a vector of values that cannot be matched between x and y.
  • no.dups: Logical indicating that suffixes are appended in more cases to avoid duplicated column names in the result.
  • ...: Additional arguments passed to or from other methods.

Return Value

The merge() function returns a data frame that combines the input data frames based on the specified columns. The return value is a new data frame that includes columns from both input data frames and matching rows based on the values in the specified common column.

Merge the Data Frames using various Join Operations

Base R merge() function is used to perform the various join operations based on the specified common column. The by parameter specifies the key column, and additional parameters (all.x, all.y, all) control the type of join. Before diving into joining the operations we need to create two data frames.

Let’s create two Data Frames having common columns. In the below example dept_id and dept_branch_id are the same on both emp_df and dept_df data frames.


# Create emp Data Frame
emp_df=data.frame(
  emp_id=c(1,2,3,4,5,6),
  name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
  superior_emp_id=c(-1,1,1,2,2,2),
  dept_id=c(10,20,10,10,40,50),
  dept_branch_id= c(101,102,101,101,104,105)
)

# Create dept Data Frame
dept_df=data.frame(
  dept_id=c(10,20,30,40),
  dept_name=c("Finance","Marketing","Sales","IT"),
  dept_branch_id= c(101,102,103,104)
)

emp_df
dept_df

Yields below output.

r merge function

Perform Inner join using merge() function in R

To perform an inner join operation on given Data frames along with a specified column it can return only the rows where there is a match in both data frames based on the specified key column. Non-matching rows are excluded from both data frames.


# Perform inner join operaton based on dept_id
inner_join_df <- merge(x = emp_df, y = dept_df, by = "dept_id")
inner_join_df

Yields below output.

r merge function

The above code only includes rows where dept_id is present in both emp_df and dept_df.

Left Join using Merge() function in R

To perform a left join on given data frames it can return all rows from the left data frame and the matched rows from the right data frame based on the specified key column. If there is no match, the columns from the right data frame will contain NA values.


# Perform left join operaton based on dept_id
left_join_df <- merge(x = emp_df, y = dept_df, by = "dept_id", all.x = TRUE)
left_join_df

# Output:
# > left_join_df
#   dept_id emp_id     name superior_emp_id dept_branch_id.x dept_name dept_branch_id.y
# 1      10      1    Smith              -1              101   Finance              101
# 2      10      3 Williams               1              101   Finance              101
# 3      10      4    Jones               2              101   Finance              101
# 4      20      2     Rose               1              102 Marketing              102
# 5      40      5    Brown               2              104        IT              104
# 6      50      6    Brown               2              105      <NA>               NA

In the resulting data frame, left_join_df, will contain all the columns from both emp_df and dept_df, merged based on the dept_id column, with all rows from emp_df and matching rows from dept_df. If there are no matches, columns from dept_df will have NA values in the corresponding positions.

Right Join using merge()

A right join operation is similar to a left join but includes all rows from the right and matched rows from the left data frame. Non-matching rows from the left data frame will have NA values in columns from the left data frame.


# Perform right join operaton based on dept_id
right_join_df <- merge(x = emp_df, y = dept_df, by = "dept_id", all.y = TRUE)
right_join_df

# Output:
> right_join_df
#   dept_id emp_id     name superior_emp_id dept_branch_id.x dept_name dept_branch_id.y
# 1      10      1    Smith              -1              101   Finance              101
# 2      10      3 Williams               1              101   Finance              101
# 3      10      4    Jones               2              101   Finance              101
# 4      20      2     Rose               1              102 Marketing              102
# 5      30     NA     <NA>              NA               NA     Sales              103
# 6      40      5    Brown               2              104        IT              104

In the resulting data frame, right_join_df, will contain all the columns from both emp_df and dept_df, merged based on the dept_id column, with all rows from dept_df and matching rows from emp_df. If there are no matches, columns from emp_df will have NA values in the corresponding positions.

Outer Join using merge() Function in R

You can perform an outer join on given data frames to return all rows from both. If there is a matching row based on the specified key column, the columns populate with values. If there are no matching rows will have NA values.


# Perform outer join operaton based on dept_id
outer_join_df <- merge(x = emp_df, y = dept_df, by = "dept_id", all = TRUE)
outer_join_df

# Output:
# > outer_join_df
#   dept_id emp_id     name superior_emp_id dept_branch_id.x dept_name dept_branch_id.y
# 1      10      1    Smith              -1              101   Finance              101
# 2      10      3 Williams               1              101   Finance              101
# 3      10      4    Jones               2              101   Finance              101
# 4      20      2     Rose               1              102 Marketing              102
# 5      30     NA     <NA>              NA               NA     Sales              103
# 6      40      5    Brown               2              104        IT              104
# 7      50      6    Brown               2              105      <NA>               NA

The above code returns all rows from both emp_df and dept_df and non-matching rows from either side will have NA values in the respective columns.

Conclusion

In this article, I have explained, the merge() function in R is a powerful tool for combining/merging data frames based on common columns. The function supports various types of join operations, including inner join, left join, right join, and outer join. Users can specify the columns for merging and control the behavior of the join through parameters such as all, all.x, all.y, and others.

Happy Learning!!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium