How to do left join on data frames in R? To perform left join use either merge()
function of base R or left_join()
function from the dplyr package. The dplyr
function is generally preferred because it is more efficient than the base R approach. The dplyr
package offers various functions to join the data frame in R.
In R, the left join is used to perform left joining on two data frames based on the common columns of single/multiple and return the data frame having all rows from the left data frame and any matching rows from the right data frame. Unmatched rows from the right data frame will have NA values in the columns from the right data frame.
Key points-
- In a left join, all the rows from the left data frame are included in the result, regardless of whether there is a match in the right data frame or not. If there is no match in the right data frame, the corresponding columns from the right data frame will contain missing values (NA).
- Rows from the right data frame that have matching values in the specified join columns are included in the result. These rows are merged with the corresponding rows from the left data frame based on the common column(s).
- Left joins performs using the
merge()
function in base R or using functions from thedplyr
package, such asleft_join()
. The syntax may vary slightly between these approaches, but the fundamental concept remains the same. - When performing a left join, you need to specify the columns used for matching between the left and right data frames. These columns should have the same name in both data frames and contain the values used for the join.
1. Quick Examples of Performing Left Join in DatFrame
Following are quick examples of performing left join on data frames.
# Quick Examples of performing left join in dataframe
# Example 1: Left join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all.x=TRUE)
# Example 2: Left join on multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"), all.x=TRUE)
# Example 3: Left join on different columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all.x=TRUE)
# Example 4: Load dplyr package
library(dplyr)
# Example 5: Using dplyr - left join multiple columns
df2 <- emp_df %>% left_join( dept_df,
by=c('dept_id','dept_branch_id'))
# Example 6: Using dplyr - left join on different columns
df2 <- emp_df %>% left_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
# Example 7: Load tidyverse package
library(tidyverse)
# Left outer Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(left_join, by='dept_id')
df2
Let’s create two Data Frames, in the below example dept_id
and dept_branch_id
columns exist on both emp_df
and dept_df
data frames.
# Create emp Data Frame
emp_df=data.frame(
emp_id=c(1,2,3,4,5,6),
name=c("Smith","Rose","Williams","Jones","Brown","Brown"),
superior_emp_id=c(-1,1,1,2,2,2),
dept_id=c(10,20,10,10,40,50),
dept_branch_id= c(101,102,101,101,104,105)
)
# Create dept Data Frame
dept_df=data.frame(
dept_id=c(10,20,30,40),
dept_name=c("Finance","Marketing","Sales","IT"),
dept_branch_id= c(101,102,103,104)
)
emp_df
dept_df
Yields below output.
2. Left Join using merge()
R base provides a merge()
function that is used to perform a left join on two or more (multiple) data frames. This function takes x
and y
data frames as left and right respectively and finally specify the by
param with the column name you want to join.
The following example performs a left join on the column dept_id
on emp_df
and dept_df
column. To perform left join use all.x=TRUE
.
# Left join
df2 <- merge(x=emp_df,y=dept_df,
by="dept_id", all.x=TRUE)
df2
Yields below output. if you have the same column names that are not used in the join condition, it suffixes the x and y to the columns on the result. In the below example check dept_branch_id
.
From our dataset, dept_id
50 doesn’t have a record on dept
dataset hence, this record contains NA on dept
columns (dept_name & dept_id). and dept_id
30 from dept
dataset dropped from the results. Below is the result of the above Join expression.
2.1 Left Join on Multiple Columns
To perform a left join on multiple columns with the same names on both data frames, use all the column names as a list to by
param. I have also created a dedicated article where I have explained how to perform join on multiple columns using several ways.
# Using merge with same column names
df2 <- merge(x=emp_df,y=dept_df,
by=c("dept_id","dept_branch_id"),
all.x=TRUE)
df2
Yields below output.
2.2 Left Join Different Column Names
Sometimes you will have data frames with different column names and you want to perform a left join on these columns. To do so specify the column names from both data frames with params by.x
and by.y
.
# R left join multiple columns
df2 <- merge(x=emp_df,y=dept_df,
by.x=c("dept_id","dept_branch_id"),
by.y=c("dept_id","dept_branch_id"),
all.x=TRUE)
df2
Since our data frame has the same column names, it results in the same output as above, I have created another article explaining how to perform join on different column names.
3. Using dplyr to Perform Left Join in R
To perform a left join on two data frames, the best approach is to use the left_join()
function from the dplyr package. First, you need to install dplyr using install.packages(‘dplyr’) and load it with library(dplyr)
.
All functions in the dplyr package accept a data frame as their first argument. When using dplyr, it is common to utilize the infix operator %>%
from the magrittr package. This operator passes the left-hand side to the first argument of the function on the right-hand side. For example, x %>% f(y)
is equivalent to f(x, y)
, allowing the result from the left-hand side to be “piped” into the function on the right-hand side.
3.1 Multiple Columns with dplyr package
# Load dplyr package
library(dplyr)
# Using dplyr - left join multiple columns
df2 <- emp_df %>% left_join( dept_df,
by=c('dept_id','dept_branch_id'))
df2
Yields below output. Since we are using both columns on the join we are not seeing duplicate columns.
3.2 Using Different Column Names with dplyr package
# Load dplyr package
library(dplyr)
# join on multiple columns
df2 <- emp_df %>% left_join( dept_df,
by=c('dept_id'='dept_id',
'dept_branch_id'='dept_branch_id'))
df2
Yields the same output as above.
4. Using tidyverse Package
By using reduce()
function from the tidyverse package you can perform join on multiple data frames, to perform left join use the left_join keyword. If you want to the left join multiple data frames, pass all data frames as a list to the reduce() function.
# Load tidyverse package
library(tidyverse)
# Left Join data.frames
list_df = list(emp_df,dept_df)
df2 <- list_df %>% reduce(left_join, by='dept_id')
df2
5. Conclusion
In this article, you have learned how to perform a left join on two data frames using the R base merge() function, left_join() functions from the dplyr package, and reduce() from the tidyverse. Using the dplyr approach is preferable for joining larger datasets because it performs more efficiently than base R.
Related Articles
- R Join Multiple Data Frames
- R Semi Join of Data Frames
- R Anti Join of data frames
- R Outer Join of the data frame
- R Right Join of the data frame
- R Inner Join of the data frame