Pandas Merge DataFrames on Multiple Columns

  • Post author:
  • Post category:Pandas
  • Post last modified:December 1, 2023

To merge two pandas DataFrames on multiple columns use pandas.merge() method. merge() is considered more versatile and flexible and we also have the same method in DataFrame.

In this article, I will explain how to merge two Pandas DataFrames by multiple columns when columns on the left and right DataFrames are the same and when column names are different.

1. Quick Examples of Merging DataFrames on Multiple Columns

If you are in a hurry, below are some quick examples of merging two pandas DataFrames on multiple columns.


# Below are some quick examples of merging DataFrames on multiple columns

# Example 1: Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)

# Example 2: Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])

# Example 3: Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])

# Example 4: Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")

# Example 5: Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")

Now, let’s create a DataFrame with a few rows and columns, execute these examples, and validate the results. The first DataFrame contains column names 'Courses''Fee', 'Duration' and the second DataFrame contains column names 'Courses', 'Fee', 'Percentage'.


# Create Pandas DataFrame
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30day','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
 
print("First DataFrame:\n", df)
print("Second DataFrame:\n", df1)

Yields below output.

pandas merge multiple columns

2. Merge Default Pandas DataFrame Without Any Key Column

You can pass two DataFrames to be merged into the pandas.merge() method. This function collects all common columns in both DataFrames and replaces each common column in both DataFrames with a single one. It merges the DataFrames df and df1 assigned to merged_df.

By default, the merge() method applies join contains on all columns that are present on both DataFrames and uses inner join. We have the columns Courses and Fee common to both the DataFrames.


# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df, df1)
print("After merging the DataFrames:\n", merged_df)

Yields below output.

pandas merge multiple columns

3. Use pandas.merge() to Multiple Columns

You can also explicitly specify the column names you want to use for joining. To specify column names use on param of the merge() method. This also takes a list of names when you want to merge multiple columns.


# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
print("After merging the DataFrames:\n", df2)

Yields the same output as above.

4. Use pandas.merge() when Column Names Different

When you have column names on the left and right that are different and want to use these as a join column, use left_on and right_on parameters. This also takes a list of column names as values to merge on multiple columns. The left_on will be set to the name of the column in the left DataFrame and right_on will be set to the name of the column in the right DataFrame.


# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print("After merging the DataFrames:\n", df2)

Yields below output.


# Output:
# After merging the DataFrames:
   Courses    Fee Duration Percentage
0    Spark  20000    30day        NaN
1  PySpark  25000   40days        20%
2   Python  30000   60days        25%
3   pandas  24000   55days        20%
4     Java  40000   50days        NaN

5. Merge Pandas DataFrame Based on Single Column

If you want to merge DataFrames based on a single key column, you can simply pass the column name as a string to the on parameter. For example:


# Set value on parameter to specify the key value for merging in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)

Yields below output.


   Courses  Fee_x Duration  Fee_y Percentage
0  PySpark  25000   40days  25000        20%
1   Python  30000   60days  30000        25%
2   pandas  24000   55days  24000        20%
3     Java  40000   50days  20000        10%

6. Complete Example For Merge DataFrames on Multiple Columns


import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30day','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
print(df)
print(df1)

# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)

# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)

# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
print(merged_df)

# Set value on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)

Frequently Asked Questions on Pandas merge multiple columns

How do I merge two DataFrames based on a single column?

To merge two DataFrames based on a single column, you can use the merge() function and specify the on parameter with the column name. For example, merged_df = pd.merge(df1, df2, on='common_column')

How can I merge based on multiple columns?

You can merge DataFrames based on multiple columns. Use the on parameter with a list of column names. For example, merged_df = pd.merge(df1, df2, on=['column1', 'column2'])

How do I merge on different columns in each DataFrame?

If the column names differ in the two DataFrames, you can use the left_on and right_on parameters. For example, merged_df = pd.merge(df1, df2, left_on='column1_df1', right_on='column2_df2')

What if I want to merge based on the index?

You can use the left_index and right_index parameters to merge on the left and right DataFrames’ indices. For example, merged_df = pd.merge(df1, df2, left_index=True, right_index=True)

How can I concatenate DataFrames along columns?

You can use the concat() function to concatenate DataFrames along columns. For example, concatenated_df = pd.concat([df1, df2], axis=1)

Conclusion

In this article, I have explained how to merge two pandas DataFrames on multiple columns using pandas.merge() method with examples.

Happy Learning !!

References

Malli

Malli is an experienced technical writer with a passion for translating complex Python concepts into clear, concise, and user-friendly articles. Over the years, he has written hundreds of articles in Pandas, NumPy, Python, and takes pride in ability to bridge the gap between technical experts and end-users.

Leave a Reply

You are currently viewing Pandas Merge DataFrames on Multiple Columns