Pandas Merge DataFrames on Multiple Columns

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:January 19, 2023
Spread the love

To merge two pandas DataFrames on multiple columns use pandas.merge() method. merge() is considered more versatile and flexible and we also have the same method in DataFrame.

In this article, I will explain how to merge two pandas DataFrames by multiple columns when columns on left and right DataFrames are the same and when column names are different.

1. Quick Examples of Merge DataFrames on Multiple Columns

If you are in a hurry, below are some quick examples of how to merge two pandas DataFrames on multiple columns.


# Below are quick example

# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)

# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])

# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])

# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")

# Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")

Now, let’s create a DataFrame with a few rows and columns, execute these examples and validate results. First DataFrame contains column names Courses, Fee, Duration and second DataFrame contains column names Courses,Fee,Percentage.


import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30day','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
 
print(df)
print(df1)

Yields below output.


   Courses    Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Python  30000   60days
3   pandas  24000   55days
4     Java  40000   50days

    Courses    Fee Percentage
0      Java  20000        10%
1   PySpark  25000        20%
2    Python  30000        25%
3    pandas  24000        20%
4  Hyperion  40000        10%

2. Merge Default Pandas DataFrame Without Any Key Column

You can pass two DataFrame to be merged to the pandas.merge() method. This collects all common columns in both DataFrames and replaces each common column in both DataFrame with a single one. It merges the DataFrames df and df1 assigns to merged_df.

By default, the merge() method applies join contains on all columns that are present on both DataFrames and uses inner join. We have the columns Courses and Fee common to both the DataFrames.


# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)

Yields below output.


   Courses    Fee Duration Percentage
0  PySpark  25000   40days        20%
1   Python  30000   60days        25%
2   pandas  24000   55days        20%

2. Use pandas.merge() to Multiple Columns

You can also explicitly specify the column names you wanted to use for joining. To use column names use on param of the merge() method. This also takes a list of names when you wanted to merge on multiple columns.


# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
print(df2)

Yields same output as above.

3. Use pandas.merge() when Column Names Different

When you have column names on left and right are different and want to use these as a join column, use left_on and right_on parameters. This also takes a list of column names as values to merge on multiple columns. The left_on will be set to the name of the column in the left DataFrame and right_on will be set to the name of the column in the right DataFrame.


# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)

Yields below output.


   Courses    Fee Duration Percentage
0    Spark  20000    30day        NaN
1  PySpark  25000   40days        20%
2   Python  30000   60days        25%
3   pandas  24000   55days        20%
4     Java  40000   50days        NaN

6. Complete Example For Merge DataFrames on Multiple Columns


import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30day','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
 
print(df)
print(df1)

# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)

# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)

# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
print(merged_df)

# Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)

Conclusion

In this article, I have explained how to merge two pandas DataFrames on multiple columns using pandas.merge() method with examples.

Happy Learning !!

References

Leave a Reply

You are currently viewing Pandas Merge DataFrames on Multiple Columns