• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:19 mins read
You are currently viewing Differences between Pandas Join vs Merge

In this article, you will learn the difference between pandas join() vs merge() methods on pandas DataFrames with examples and use cases of each. pandas provide these two methods to combine DataFrames on columns and indexes however there are some differences between these two.

Advertisements

Pandas Join vs Merge Key Points

  • join() method is used to perform join on row indices and doesn’t support joining on columns unless setting column as an index.
  • join() by default performs left join.
  • merge() method is used to perform join on indices, columns, and a combination of these two.
  • merge() by default performs inner join.
  • Both these methods support inner, left, right, and outer join types. merge additionally supports the cross-join.

Pandas Join vs Merge Differences

Both methods join() and merge() are used to perform joining pandas DataFrames on columns meaning, it combines all columns from two or more DataFrames into a single DataFrame. The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you want to join on, this also supports joining on indexes and combination of index and columns.

Both these methods support left on the column and right on the index however, merge additionally supports left on the index and right on the column.

There is another difference, join() by default performs left join whereas merge() default uses inner join.

Below are some similarities and differences between pandas join() vs merge() methods.

Join Featurejoin()merge()
innerYesYes
leftYesYes
rightYesYes
outerYesYes
crossXYes
Join on IndicesYesYes
Join on ColumnsXYes
Left on the index, right on columnYesYes
Left on the index, right on the columnXYes
pandas join vs merge

2. Pandas Join vs Merge Examples

Now, let’s explore the differences between join vs merge with examples. First, let’s create two DataFrames with one common column on both DataFrames.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

Yields below DataFrames

pandas join vs merge
Left DataFrame
pandas join vs merge
Right DataFrame

3. Join on Index vs Merge on Index

As I said above, join is used to perform combining DataFrames on row indexes, it is the most efficient way to use for indices. let’s see with an example.


# pandas left join two DataFrames by Index
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')

Yields below output. Notice that it performed left join on two DataFrames as left join is the default behavior of the join method.

pandas join vs merge

In order to do the same operation with merge(), you need to explicitly specify the left join using how param.


# Merge two DataFrames by index using pandas.merge()
df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df3)

Yields below output.

pandas join vs merge

4. Join on Column vs Merge on Column

merge() allows us to use columns in order to combine DataFrames and by default, it uses inner join. Below example by default join on Courses the column as this is the only common column in both DataFrames.


# pandas merge - inner join by Column
df3=pd.merge(df1,df2)

Yields below output.

pandas difference join merge

You can also explicitly specify the column you want to join on


# Merge DataFrames by Column
df3=pd.merge(df1,df2, on='Courses')

if you have column names different on left and right DataFrames, you can specify the column names


# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')

To perform a similar operation with the join() method, you need to set the column to index usingset_index().


# Use join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

5. Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

# pandas left join two DataFrames by Index
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')

# Merge two DataFrames by index using pandas.merge()
df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df3)

# pandas merge - inner join by Column
df3=pd.merge(df1,df2)
print(df3)

# Merge DataFrames by Column
df3=pd.merge(df1,df2, on='Courses')
print(df3)

# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)

# Use join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

Frequently Asked Questions

What is the difference between Pandas merge and join operations?

Merge: Pandas merge() is a versatile method for combining two or more DataFrames based on common columns or indices. It allows you to specify the columns to join on and the type of join (inner, outer, left, or right).
Join: join() is a specific instance of the merge() method used for combining DataFrames based on their indices. It is a more concise way to perform inner and left joins when you want to join DataFrames on their indices.

What is the primary difference between Pandas join and merge?

The primary difference lies in the way they combine DataFrames. merge is a more general-purpose method that allows joining on arbitrary columns, while join is a DataFrame method specifically designed for combining DataFrames based on their indices.

How do the default behaviors of merge and join differ?

The default behavior of merge is to perform an inner join, and you can change it using the how parameter. On the other hand, the default behavior of join is a left join, where the index of the calling DataFrame is used to join with the index of the other DataFrame.

Can you join DataFrames with different index/column names using merge and join?

You can join DataFrames with different column names using merge by specifying the left_on and right_on parameters. However, join only works on indices, so it’s more restrictive in terms of aligning DataFrames based on their inde

When should I use merge and when should I use join?

Use merge when you need flexibility, want to join on columns with different names, or perform more complex join operations. Use join when you want to combine DataFrames based on their indices, and you prefer a more concise syntax for index-based joins.

How do they handle overlapping column names in the result?

In merge, you can specify suffixes for overlapping column names using the suffixes parameter. In join, if there are columns with the same name in both DataFrames, Pandas automatically appends _x and _y to differentiate them.

6. Conclusion

In this article, you have learned the key differences between pandas joining and merging. join() method is used to perform join on row indices and doesn’t support joining on columns unless setting column as an index. join() by default performs left join. merge() method is used to perform join on indices, columns, and a combination of these two. merge() by default performs inner join. Both these methods support inner, left, right, and outer join types. merge additionally supports cross-join.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium