To merge two pandas DataFrames on multiple columns, you can use the merge()
function and specify the columns to join on using the on
parameter. This function is considered more versatile and flexible and we also have the same method in DataFrame.
In this article, I will explain how to merge two Pandas DataFrames by multiple columns when columns on the left and right DataFrames are the same and when column names are different.
Key Points –
- Pandas provides the
merge()
function to combine DataFrames based on common columns. - Merging on multiple columns is achieved by passing a list of column names to the
on
parameter. - Multi-column merges in pandas provide more precise control over data integration by requiring multiple columns to match for a row to be included in the result.
- Merging on multiple columns can be useful for complex data integration tasks where a single column match might not be sufficient.
Quick Examples of Merging DataFrames on Multiple Columns
Following are quick examples of merging two DataFrames on multiple columns.
# Quick examples of merging dataframes on multiple columns
# Example 1: Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
# Example 2: Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
# Example 3: Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
# Example 4: Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
# Example 5: Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
To run some examples of merging pandas DataFrames on multiple columns, let’s create a Pandas DataFrame.
# Create Pandas DataFrame
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
'Fee' : [20000,25000,30000,24000,40000],
'Duration':['30day','40days','60days','55days','50days']})
df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
'Fee': [20000,25000,30000,24000,40000],
'Percentage':['10%','20%','25%','20%','10%']})
print("First DataFrame:\n", df)
print("Second DataFrame:\n", df1)
Yields below output.
Merge Default Pandas DataFrame Without Any Key Column
You can pass two DataFrames to be merged into the pandas.merge() method. This function collects all common columns in both DataFrames and replaces each common column in both DataFrames with a single one. It merges the DataFrames df and df1 assigned to merged_df
.
By default, the merge()
method performs an inner join on all columns that are common to both DataFrames. We have the columns Courses
and Fee
common to both the DataFrames.
# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df, df1)
print("After merging the DataFrames:\n", merged_df)
Output.
Use pandas.merge() to Multiple Columns
You can also explicitly specify the column names to use for the join. To specify column names use on
param of the merge()
method. This also takes a list of names when you want to merge multiple columns.
# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
print("After merging the DataFrames:\n", df2)
# Output:
# After merging the DataFrames:
# Courses Fee Duration Percentage
# 0 PySpark 25000 40days 20%
# 1 Python 30000 60days 25%
# 2 pandas 24000 55days 20%
Use pandas.merge() when Column Names Different
When you have column names on the left and right that are different and want to use these as a join column, use left_on
and right_on
parameters. This also takes a list of column names as values to merge on multiple columns. The left_on
will be set to the name of the column in the left DataFrame and right_on
will be set to the name of the column in the right DataFrame.
# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print("After merging the DataFrames:\n", df2)
# Output:
# After merging the DataFrames:
# Courses Fee Duration Percentage
# 0 Spark 20000 30day NaN
# 1 PySpark 25000 40days 20%
# 2 Python 30000 60days 25%
# 3 pandas 24000 55days 20%
# 4 Java 40000 50days NaN
Merge Pandas DataFrame Based on Single Column
If you want to merge DataFrames based on a single key column, you can simply pass the column name as a string to the on
parameter. For example:
# Set value on parameter
# To specify the key value for merging in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)
Yields below output.
Courses Fee_x Duration Fee_y Percentage
0 PySpark 25000 40days 25000 20%
1 Python 30000 60days 30000 25%
2 pandas 24000 55days 24000 20%
3 Java 40000 50days 20000 10%
Complete Example For Merge DataFrames
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
'Fee' : [20000,25000,30000,24000,40000],
'Duration':['30day','40days','60days','55days','50days']})
df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
'Fee': [20000,25000,30000,24000,40000],
'Percentage':['10%','20%','25%','20%','10%']})
print(df)
print(df1)
# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)
# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)
# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
print(merged_df)
# Set value on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)
FAQ on Merge Multiple Columns
To merge two DataFrames based on a single column, you can use the merge()
function and specify the on
parameter with the column name. For example, merged_df = pd.merge(df1, df2, on='common_column')
You can use the left_index
and right_index
parameters to merge on the left and right DataFrames’ indices.
You can merge DataFrames on their indices in pandas. You can achieve this by setting the left_index
and right_index
parameters to True
. This is useful when you need to merge DataFrames based on their row labels rather than columns.
Merging will align DataFrames on the specified columns. Rows with matching values in the columns will be combined, while rows without matches will be handled according to the how
parameter (e.g., kept in an outer join).
Conclusion
In conclusion, the pandas.merge()
method is a versatile tool for merging DataFrames on multiple columns. By specifying the columns to merge on, you can combine DataFrames in various ways to suit your data analysis needs.
Happy Learning !!
Related Article
- Combine Two Pandas DataFrames With Examples
- Convert Pandas DataFrame to Dictionary (Dict)
- Iterate Over Columns of Pandas DataFrame
- Pandas Get Total | Sum of Column
- Pandas Merge Multiple DataFrame
- Pandas Merge DataFrames on Index
- Pandas Concat Two DataFrames Explained
- Pandas combine two Series
- Pandas Sort by Column Values DataFrame
- Pandas iterate over the columns Of DataFrame