• Post author:
  • Post category:Pandas
  • Post last modified:December 5, 2024
  • Reading time:14 mins read
You are currently viewing Pandas Merge Suffixes Explained Examples

In pandas, when merging two DataFrames with columns that have the same names, you can use the suffixes parameter to differentiate between columns from each DataFrame. This avoids column name collisions in the resulting DataFrame.

Advertisements

In this article, I will explain how suffixes are string labels used to differentiate overlapping column names when merging two DataFrames.

Key Points –

  • The suffixes parameter is used to handle column name collisions when merging two DataFrames with overlapping column names.
  • If suffixes is not specified, pandas automatically use the suffixes _left and _right for columns from the left and right DataFrames, respectively.
  • If multiple columns have the same name, all of them will receive the specified suffixes, preserving their uniqueness.
  • The use of suffixes is applicable across different types of merges (inner, outer, left, right), helping to maintain clarity regardless of the merge operation.
  • Clear suffixes enhance the readability of the resulting DataFrame, making it easier for users to identify the source of each column.

Usage of Pandas Merge Suffixes

The suffixes parameter in the pandas.merge() function is a powerful tool that helps manage column name collisions when merging two DataFrames that have overlapping column names (other than the merge key).

Let’s create DataFrames of two to demonstrate how merge suffixes function.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print("First DataFrames:\n", df1)
print("Second DataFrames:\n", df2)

Yields below output.

pandas merge suffixes

Suffixes as _left and _right

To merge the two DataFrames (df1 and df2) using the Courses column as the key, while applying custom suffixes (_left for df1 and _right for df2), you can use the pd.merge() function with the suffixes argument.


# Merging the dataframes on 'Courses' with suffixes '_left' and '_right'
merged_df = pd.merge(df1, df2, on='Courses', how='outer', suffixes=('_left', '_right'))
print("Merged DataFrame with suffixes:\n", merged_df)

Here,

  • Courses that are common to both dataframes (e.g., “Spark”, “Python”) have their respective columns from both dataframes.
  • For rows where there is no matching course, you get NaN values in the columns that didn’t have a match.
pandas merge suffixes

Different Columns, No Suffixes Needed

Alternatively, when merging two DataFrames that have no overlapping column names (except the merge key), suffixes are not needed because there are no column name collisions. In this case, each column remains unique in the merged DataFrame, so specifying suffixes becomes unnecessary.


# Merge the DataFrames on the 'Courses' column without suffixes 
# (since there are no overlapping columns)
df3= pd.merge(df1, df2, on='Courses', how='outer')
print("Merged DataFrame without suffixes (different columns):\n", df3)

Here, no suffixes are needed because there are no conflicting column names, except for the index used for merging. This example yields the above output.

Custom Suffixes with Different Column Names

When merging DataFrames that have different column names but you still want to apply custom suffixes, you can specify suffixes for the overlapping columns. This is more relevant when merging on a key column but also if there are other columns that have the same name in both DataFrames.


# Merge the DataFrames on the 'Courses' column with custom suffixes for overlapping columns
df3 = pd.merge(df1, df2, on='Courses', how='outer', suffixes=('_df1', '_df2'))
print("Merged DataFrame with custom suffixes '_df1' and '_df2':\n", df3)

Here,

  • The Courses column is the key for merging, and since it’s present in both DataFrames, it is included as is.
  • The Fee and Duration columns from df1 appear without suffixes since they are unique to that DataFrame.
  • The overlapping Discount columns are suffixed with _df1 (from df1) and _df2 (from df2).
  • Rows that don’t match in either DataFrame are filled with NaN for the missing values.

This example yields the above output.

Suffixes with Inner Join on Multiple Columns

Similarly, to perform an inner join on multiple columns while applying suffixes to overlapping column names, you can use the pd.merge() function with the on parameter specifying the list of columns you want to join on, and the suffixes parameter for any overlapping columns.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Fee' :[20000,25000,22000,30000],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

# Merge the DataFrames on multiple columns with custom suffixes
df3 = pd.merge(df1, df2, on=['Courses', 'Fee'], how='inner', suffixes=('_df1', '_df2'))
print("Merged DataFrame with inner join on multiple columns and custom suffixes:\n", df3)

# Output:
# Merged DataFrame with inner join on multiple columns and custom suffixes:
#    Courses    Fee Duration  Discount
# 0   Spark  20000   30days      2000
# 1  Python  22000   35days      1200

Here,

  • The merge is performed on the Courses and Fee columns.
  • Only the rows where both Courses and Fee match in df1 and df2 are included in the result.
  • The Value columns from both DataFrames are distinguished by the suffixes _df1 and _df2, which clarify their origin.

FAQ on Pandas Merge Suffixes

What are suffixes in Pandas merges?

Suffixes are string labels added to overlapping column names when merging two DataFrames. They help distinguish between columns that share the same name in both DataFrames.

How do I specify suffixes when merging DataFrames?

You can specify suffixes using the suffixes parameter in the pd.merge() function.

What happens if I don’t specify suffixes?

If you don’t specify suffixes and there are overlapping column names, Pandas automatically appends _x and _y to distinguish them. For example, if both DataFrames have a column named Discount, the resulting DataFrame will have Discount_x and Discount_y.

Can I use custom suffixes when the column names are different?

You can still use custom suffixes even if the columns are different. However, they will only apply to overlapping column names. If there are no overlapping names, the suffixes will not affect the merged DataFrame.

Do suffixes apply to index names?

Suffixes only apply to column names. If there are overlapping index names, they will not be modified automatically. You may need to manage overlapping indices manually.

Conclusion

In this discussion, we explored various scenarios of merging DataFrames using Pandas, focusing particularly on the use of suffixes to handle overlapping column names.

Happy Learning!!