• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:18 mins read
You are currently viewing Pandas Merge DataFrames Explained Examples

pandas support pandas.merge() and DataFrame.merge() to merge DataFrames which is exactly similar to SQL join and supports different types of join inner, left, right, outer, cross. By default, it uses inner join where keys don’t match the rows get dropped from both DataFrames, and the result DataFrame contains rows that match on both.

Advertisements

In this article, I will explain how to merge DataFrames with examples like merging by columns, merging by index, merging on multiple columns, and different join types. In case, if you want to merge on Indexes use pandas join() which default supports joining on the index.

Pandas Merge Key Points

  • pandas.merge() and DataFrame.merge() work the same way that merges two or more DataFrames.
  • This works similarly to Database joins
  • Supports joins on rows and columns
  • When doing a join on columns, it ignores indexes.
  • When joining on the index, the resultant DataFrame contains indexes from sources.
  • When no params are used, by default join happens on all common columns.

1. Quick Examples of pandas merge() DataFrames

Below are some quick examples of the pandas merge() method that merges two or multiple DataFrames.


# Quick Examples of Merging pandas DataFrames

# Example 1: Pandas.merge()
df3=pd.merge(df1,df2)

# Example 2: DataFrame.merge()
df3=df1.merge(df2)

# Example 3: Merge by column
df3=pd.merge(df1,df2, on='Courses')

# Example 4: Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')

# Example 5: Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)

# Example 6: Merge by multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])

# Example 7: Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')

# Example 8: Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')

# Example 9: Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')

2. Pandas Merge DataFrames

pandas.merge() and DataFrame.merge() are used to merge two DataFrames or multiple DataFrames. Both these methods work exactly the same and they also take a similar number of params. Merging DataFrames is nothing but joining DataFrames similar to Database join.

Similar to the Database join, the merge() method also supports several join types like left, right, inner, outer and cross. Use how param to specify the join type.

By default merge() uses inner join on columns that are present on both DataFrames. You can also explicitly specify the columns you want to join and join by row index.

Though DataFrame also has a join() method there are slight differences between merge() and join() methods.

Now, let’s see syntax and usage with examples. By using merge you can also merge series into DataFrame.

2.1 pandas.merge() Syntax

Below is the syntax and usage of pandas.merge() method. For the latest syntax refer to pandas.merge()


# Pandas.merge() Syntax
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

2.2 pandas.DataFrame.merge() Syntax

Below is the syntax and usage of pandas.DataFrame.merge() method.


# Pandas.DataFrame.merge() Syntax
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Note: Both these merge() methods are similar and take the same type of arguments.

3. Pandas Merge DataFrames Examples

Let’s create DataFrames of two to demonstrate how merge works.


# Create DataFrames
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
print("First DataFrame:", df1)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("Second DataFrame:", df2)

Yields below output.

pandas merge DataFrames

3.1 pandas.merge() DataFrames

In my first example of merge(), I will use default params where it does inner join on the same columns presented on both DataFrames. From our DataFrame the common column is Courses. If you want to concat DataFrames use pandas.concat() method.


# Using pandas.merge()
df3= pd.merge(df1,df2)

# Using DataFrame.merge()
df3=df1.merge(df2)
print("After merging the two DataFrames:\n", df3)

Yields below output.

pandas merge DataFrames

3.2 Merge DataFrames on Columns

Sometimes it is required to specify the column names you want to merge, to do so use on param. For example df3=pd.merge(df1,df2, on='Courses').


# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses')

Yields the same output as above first example. In case if you have different columns on the left and right and want to join on these use left_on and right_on params.


# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)

3.3 Merge DataFrames on Index

You can also Merge DataFrames by Index using left_index and right_index params. If you notice the above examples, it ignored the index from the merged DataFrame result. Joining on Index retains the indexes.

Note that you can also use the index from the left and columns from the right DataFrame and vice-versa to perform the merge.


# Merge DataFrames by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
print(df3)

Yields


# Output:
   Courses_x    Fee Duration Courses_y  Discount
r1     Spark  20000   30days     Spark      2000
r3    Python  22000   35days    Python      1200

3.4 Merging DataFrames on Multiple Columns

Use param on with a list of column names when you want to merge DataFrames on multiple columns. It also supports different column names on the left and right DataFrames, below is an example.


# Use pandas.merge() to on multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df3)

# Output:
#   Courses    Fee Duration_x  Discount Duration_y
# 0   Spark  20000     30days      2000     30days
# 1  Python  22000     35days      1200     35days

3.5 Merging Using Different Join Types

Merge is similar the SQL join hence, it supports different join types inner, left, right, and outer. By default, merge uses inner join. Let’s see other join types in this section.

Left Join – It is used to perform left join on DataFrames, also called Left Outer Join that returns all rows from the left DataFrame regardless of the match found on the right DataFrame. When the join expression doesn’t match, it assigns null for that record and drops records from the right where the match is not found.


# Merge by left Join
df3=pd.merge(df1,df2, on='Courses', how='left')
print(df3)

# Output:
#    Courses    Fee Duration  Discount
# 0    Spark  20000   30days    2000.0
# 1  PySpark  25000   40days       NaN
# 2   Python  22000   35days    1200.0
# 3   pandas  30000   50days       NaN

Right Join – Also called Right Outer Join – This join is the opposite of left join, here it returns all rows from the right DataFrame regardless of math found on the left. When the join expression doesn’t match, it assigns null for that record and drops records from the left where a match is not found.


# Merge by right Join
df3=pd.merge(df1,df2, on='Courses', how='right')
print(df3)

# Output:
#   Courses      Fee Duration  Discount
# 0   Spark  20000.0   30days      2000
# 1    Java      NaN      NaN      2300
# 2  Python  22000.0   35days      1200
# 3      Go      NaN      NaN      2000

Outer Join – It is used to perform outer join on DataFrames, Also called Full Outer Join – Returns all rows from both DataFrames. Where the join expression doesn’t match it returns null on respective cells.


# Merge by outer Join
df3=pd.merge(df1,df2, on='Courses', how='outer')
print(df3)

# Outputs:
#    Courses      Fee Duration  Discount
# 0    Spark  20000.0   30days    2000.0
# 1  PySpark  25000.0   40days       NaN
# 2   Python  22000.0   35days    1200.0
# 3   pandas  30000.0   50days       NaN
# 4     Java      NaN      NaN    2300.0
# 5       Go      NaN      NaN    2000.0

5. Complete Examples


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

# pandas.merge()
df3=pd.merge(df1,df2)
print(df3)

# DataFrame.merge()
df3=df1.merge(df2)
print(df3)

# Merge by column
df3=pd.merge(df1,df2, on='Courses')
print(df3)

# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)

# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
print(df3)

# Use pandas.merge() to on multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])
print(df3)

# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')
print(df3)

# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')
print(df3)

# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')
print(df3)
What is merging in Pandas?

Merging in Pandas refers to the process of combining two or more DataFrames into a single DataFrame based on a common column or index.

How do I merge two DataFrames in Pandas?

You can merge two DataFrames in Pandas using the pd.merge() function. This function takes the two DataFrames to be merged as its primary arguments and allows you to specify various options for how the merge should be performed.

Can I merge on multiple columns?

Yes, you can merge on multiple columns by passing a list of column names to the on parameter. For example, pd.merge(df1, df2, on=['col1', 'col2']) will merge based on both ‘col1’ and ‘col2’.

What if the column names are different in the two DataFrames?

If the column names are different, you can use the left_on and right_on parameters to specify the corresponding columns in the left and right DataFrames.

Conclusion

I hope you have learned the pandas merge() method that merges the DataFrame. Merging is a join operation that combines the columns from both DataFrames. Merge also supports joining on columns, index, and a combination of both.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium