In pandas, when merging two DataFrames with columns that have the same names, you can use the suffixes
parameter to differentiate between columns from each DataFrame. This avoids column name collisions in the resulting DataFrame.
In this article, I will explain how suffixes are string labels used to differentiate overlapping column names when merging two DataFrames.
Key Points –
- The suffixes parameter is used to handle column name collisions when merging two DataFrames with overlapping column names.
- If
suffixes
is not specified, pandas automatically use the suffixes_left
and_right
for columns from the left and right DataFrames, respectively. - If multiple columns have the same name, all of them will receive the specified suffixes, preserving their uniqueness.
- The use of suffixes is applicable across different types of merges (
inner
,outer
,left
,right
), helping to maintain clarity regardless of the merge operation. - Clear suffixes enhance the readability of the resulting DataFrame, making it easier for users to identify the source of each column.
Usage of Pandas Merge Suffixes
The suffixes
parameter in the pandas.merge()
function is a powerful tool that helps manage column name collisions when merging two DataFrames that have overlapping column names (other than the merge key).
Let’s create DataFrames of two to demonstrate how merge suffixes function.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("First DataFrames:\n", df1)
print("Second DataFrames:\n", df2)
Yields below output.
Suffixes as _left and _right
To merge the two DataFrames (df1
and df2
) using the Courses
column as the key, while applying custom suffixes (_left
for df1
and _right
for df2
), you can use the pd.merge()
function with the suffixes
argument.
# Merging the dataframes on 'Courses' with suffixes '_left' and '_right'
merged_df = pd.merge(df1, df2, on='Courses', how='outer', suffixes=('_left', '_right'))
print("Merged DataFrame with suffixes:\n", merged_df)
Here,
Courses
that are common to both dataframes (e.g., “Spark”, “Python”) have their respective columns from both dataframes.- For rows where there is no matching course, you get
NaN
values in the columns that didn’t have a match.
Different Columns, No Suffixes Needed
Alternatively, when merging two DataFrames that have no overlapping column names (except the merge key), suffixes are not needed because there are no column name collisions. In this case, each column remains unique in the merged DataFrame, so specifying suffixes becomes unnecessary.
# Merge the DataFrames on the 'Courses' column without suffixes
# (since there are no overlapping columns)
df3= pd.merge(df1, df2, on='Courses', how='outer')
print("Merged DataFrame without suffixes (different columns):\n", df3)
Here, no suffixes are needed because there are no conflicting column names, except for the index used for merging. This example yields the above output.
Custom Suffixes with Different Column Names
When merging DataFrames that have different column names but you still want to apply custom suffixes, you can specify suffixes for the overlapping columns. This is more relevant when merging on a key column but also if there are other columns that have the same name in both DataFrames.
# Merge the DataFrames on the 'Courses' column with custom suffixes for overlapping columns
df3 = pd.merge(df1, df2, on='Courses', how='outer', suffixes=('_df1', '_df2'))
print("Merged DataFrame with custom suffixes '_df1' and '_df2':\n", df3)
Here,
- The
Courses
column is the key for merging, and since it’s present in both DataFrames, it is included as is. - The
Fee
andDuration
columns fromdf1
appear without suffixes since they are unique to that DataFrame. - The overlapping
Discount
columns are suffixed with_df1
(fromdf1
) and_df2
(fromdf2
). - Rows that don’t match in either DataFrame are filled with
NaN
for the missing values.
This example yields the above output.
Suffixes with Inner Join on Multiple Columns
Similarly, to perform an inner join on multiple columns while applying suffixes to overlapping column names, you can use the pd.merge()
function with the on
parameter specifying the list of columns you want to join on, and the suffixes
parameter for any overlapping columns.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Fee' :[20000,25000,22000,30000],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
# Merge the DataFrames on multiple columns with custom suffixes
df3 = pd.merge(df1, df2, on=['Courses', 'Fee'], how='inner', suffixes=('_df1', '_df2'))
print("Merged DataFrame with inner join on multiple columns and custom suffixes:\n", df3)
# Output:
# Merged DataFrame with inner join on multiple columns and custom suffixes:
# Courses Fee Duration Discount
# 0 Spark 20000 30days 2000
# 1 Python 22000 35days 1200
Here,
- The merge is performed on the
Courses
andFee
columns. - Only the rows where both
Courses
andFee
match indf1
anddf2
are included in the result. - The
Value
columns from both DataFrames are distinguished by the suffixes_df1
and_df2
, which clarify their origin.
FAQ on Pandas Merge Suffixes
Suffixes are string labels added to overlapping column names when merging two DataFrames. They help distinguish between columns that share the same name in both DataFrames.
You can specify suffixes using the suffixes
parameter in the pd.merge()
function.
If you don’t specify suffixes and there are overlapping column names, Pandas automatically appends _x
and _y
to distinguish them. For example, if both DataFrames have a column named Discount
, the resulting DataFrame will have Discount_x
and Discount_y
.
You can still use custom suffixes even if the columns are different. However, they will only apply to overlapping column names. If there are no overlapping names, the suffixes will not affect the merged DataFrame.
Suffixes only apply to column names. If there are overlapping index names, they will not be modified automatically. You may need to manage overlapping indices manually.
Conclusion
In this discussion, we explored various scenarios of merging DataFrames using Pandas
, focusing particularly on the use of suffixes
to handle overlapping column names.
Happy Learning!!
Related Articles
- Pandas Join Two DataFrames
- Pandas Left Join Explained By Examples
- Differences between Pandas Join vs Merge
- How to Merge Series into Pandas DataFrame
- Pandas Concatenate Two Columns
- Pandas Join Explained With Examples
- Pandas Outer Join Explained By Examples
- Pandas Join DataFrames on Columns
- Pandas Merge DataFrames Explained Examples
- Pandas Combine Two DataFrames With Examples
- Pandas Merge Indicator Explained With Examples