• Post author:
  • Post category:Pandas
  • Post last modified:December 3, 2024
  • Reading time:20 mins read
You are currently viewing Pandas Merge Indicator Explained With Examples

In Pandas, the merge() function has an option called indicator that adds a special column to the output DataFrame, which indicates the source of each row. This is useful for understanding how the merge operation affected the data. The column created is typically named _merge and contains information on whether each row was found in the left DataFrame only, the right DataFrame only, or both.

Advertisements

In this article, I will explain the Pandas merge indicator by using its syntax, parameters, and usage, when you use the indicator parameter in a Pandas merge() function, it returns a merged DataFrame that includes an extra column to identify the source of each row. This additional column, which is named _merge by default (or a custom name if provided), can contain one of three string values for each row.

Key Points –

  • The indicator parameter in pd.merge() adds a column that tracks the source of each row in the merged DataFrame.
  • The default indicator column is named _merge unless customized by the user.
  • The _merge column contains three values: left_only, right_only, and both.
  • left_only indicates rows that appear only in the left DataFrame.
  • right_only indicates rows that appear only in the right DataFrame.
  • The indicator parameter is particularly useful in outer joins for analyzing data overlap and differences.
  • You can rename the indicator column by passing a string value to the indicator parameter.

Syntax of Pandas Merge Indicator

Let’s know the syntax of the pandas merge indicator.


# Syntax of pandas merge indicator
pd.merge(left, right, how='type_of_join', on='column_name', indicator=True)

Parameters

Following are the parameters of the pandas merge indicator.

  • left – The first DataFrame to merge.
  • right – The second DataFrame to merge.
  • how – Type of join to perform ('left', 'right', 'inner', 'outer').
  • on – Column or index level(s) on which to merge the DataFrames.
  • indicator -Set to True to add a default _merge column that tracks the origin of each row. Alternatively, pass a string to rename the indicator column.

Usage of Pandas Merge Indicator

The indicator parameter, when set to True, adds a new column to the merged DataFrame. This column indicates the source of each row, showing whether it was found in the left DataFrame, the right DataFrame, or both.

To run some examples of the pandas merge indicator, let’s create two DataFrame.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print("First DataFrames:\n", df1)
print("Second DataFrames:\n", df2)

Yields below output.

pandas merge indicator

Basic Outer Join with Indicator

To perform a basic outer join with an indicator using the two DataFrames df1 and df2, you can use the merge() function from Pandas. The how='outer' argument specifies that you want an outer join, and the indicator=True argument adds a new column to the output DataFrame, indicating the source of each row.


# Performing an outer join with indicator
merged_df = pd.merge(df1, df2, on='Courses', how='outer', indicator=True)
print("Merged DataFrame with Indicator:\n", merged_df)

Here,

  • on='Courses': This specifies that the join should be performed on the Courses column.
  • how='outer': This specifies that all rows from both DataFrames should be included, with NaN filled in where there are no matches.
  • indicator=True: This adds a _merge column that shows whether the row was found in left_only, right_only, or both.
pandas merge indicator

Inner Join with Indicator

Alternatively, to perform an inner join with an indicator using the two DataFrames df1 and df2, you can again use the merge() function from Pandas, but this time with how='inner'. The indicator will still show which rows were found in both DataFrames.


# Perform an inner join with indicator
df3 = pd.merge(df1, df2, on='Courses', how='inner', indicator=True)
print("Inner Merged DataFrame with Indicator:\n", df3)

# Output:
# Inner Merged DataFrame with Indicator:
#   Courses    Fee Duration  Discount _merge
# 0   Spark  20000   30days      2000   both
# 1  Python  22000   35days      1200   both

Here,

  • on='Courses': This specifies that the join should be performed on the Courses column.
  • how='inner': This specifies that only the rows that have matching values in both DataFrames will be included.
  • indicator=True: This adds a _merge column to indicate the source of each row.

Right Join with Indicator

To perform a right join with an indicator using the two DataFrames df1 and df2, you can use the merge() function from Pandas, specifying how='right'. This will include all rows from the right DataFrame (df2), and the indicator will show the source of each row.


# Perform a right join with indicator
df3 = pd.merge(df1, df2, on='Courses', how='right', indicator=True)
print("Right Merged DataFrame with Indicator:\n", df3)

# Output:
# Right Merged DataFrame with Indicator:
#   Courses      Fee Duration  Discount      _merge
# 0   Spark  20000.0   30days      2000        both
# 1  Python  22000.0   35days      1200        both
# 2    Java      NaN      NaN      2300  right_only
# 3      Go      NaN      NaN      2000  right_only

Here,

  • on='Courses': Specifies that the join should be performed on the Courses column.
  • how='right' :This indicates that all rows from the right DataFrame (df2) should be included, with matching rows from the left DataFrame (df1).
  • indicator=True : Adds a _merge column indicating where each row comes from.

Merging on Multiple Columns with Indicator

To merge two DataFrames on multiple columns with an indicator using Pandas, you can specify a list of column names for the on parameter in the merge() function.


import pandas as pd

# Original DataFrames
technologies = {
    'Courses': ["Spark", "PySpark", "Python", "pandas"],
    'Fee': [20000, 25000, 22000, 30000],
    'Duration': ['30days', '40days', '35days', '50days'],
    'Course_Name': ['Big Data', 'Data Science', 'Programming', 'Library']
}
index_labels = ['r1', 'r2', 'r3', 'r4']
df1 = pd.DataFrame(technologies, index=index_labels)

technologies2 = {
    'Courses': ["Spark", "Java", "Python", "Go"],
    'Discount': [2000, 2300, 1200, 2000],
    'Course_Name': ['Big Data', 'Programming', 'Programming', 'Programming']
}
index_labels2 = ['r1', 'r6', 'r3', 'r5']
df2 = pd.DataFrame(technologies2, index=index_labels2)

# Perform an outer join on multiple columns with indicator
df3 = pd.merge(df1, df2, on=['Courses', 'Course_Name'], how='outer', indicator=True)
print("Merged DataFrame with Indicator:\n", df3)

# Output:
# Merged DataFrame with Indicator:
#    Courses      Fee Duration    Course_Name  Discount      _merge
# 0    Spark  20000.0   30days      Big Data    2000.0        both
# 1  PySpark  25000.0   40days  Data Science       NaN   left_only
# 2   Python  22000.0   35days   Programming    1200.0        both
# 3   pandas  30000.0   50days       Library       NaN   left_only
# 4     Java      NaN      NaN   Programming    2300.0  right_only
# 5       Go      NaN      NaN   Programming    2000.0  right_only

Here,

  • on=['Courses', 'Course_Name']: Specifies that the join should be performed on both Courses and Type columns.
  • how='outer': This means that all rows from both DataFrames will be included, and unmatched rows will be filled with NaN.
  • indicator=True: Adds a _merge column indicating where each row comes from.

Merging with Different Key Columns and Indicator

Similarly, to merge two DataFrames with different key columns while including an indicator in the output, you can specify the column names for each DataFrame using the left_on and right_on parameters in the merge() function.


import pandas as pd

# First DataFrame
technologies = {
    'Courses': ["Spark", "PySpark", "Python", "pandas"],
    'Fee': [20000, 25000, 22000, 30000],
    'Duration': ['30days', '40days', '35days', '50days']
}
index_labels = ['r1', 'r2', 'r3', 'r4']
df1 = pd.DataFrame(technologies, index=index_labels)

# Second DataFrame with different key column
technologies2 = {
    'Course_Name': ["Spark", "Java", "Python", "Go"],
    'Discount': [2000, 2300, 1200, 2000]
}
index_labels2 = ['r1', 'r6', 'r3', 'r5']
df2 = pd.DataFrame(technologies2, index=index_labels2)

# Perform an outer join with different key columns and indicator
df3 = pd.merge(df1, df2, left_on='Courses', right_on='Course_Name', how='outer', indicator=True)
print("Merged DataFrame with Different Key Columns and Indicator:\n", df3)

# Output:
# Merged DataFrame with Different Key Columns and Indicator:
#    Courses      Fee Duration Course_Name  Discount      _merge
# 0    Spark  20000.0   30days       Spark    2000.0        both
# 1  PySpark  25000.0   40days         NaN       NaN   left_only
# 2   Python  22000.0   35days      Python    1200.0        both
# 3   pandas  30000.0   50days         NaN       NaN   left_only
# 4      NaN      NaN      NaN        Java    2300.0  right_only
# 5      NaN      NaN      NaN          Go    2000.0  right_only

Here,

  • left_on='Courses' : Specifies that the key column from the left DataFrame (df1) to join on is Courses.
  • right_on='Course_Name' : Specifies that the key column from the right DataFrame (df2) to join on is Course_Name.
  • how='outer': Indicates that all rows from both DataFrames should be included in the result, with NaN where there are no matches.
  • indicator=True : Adds a _merge column indicating where each row comes from.

FAQ on Pandas Merge Indicator

How do I use the indicator feature?

The indicator feature in the pandas.merge() function is a great way to understand the origin of rows when merging two DataFrames. This feature adds a column to the resulting DataFrame that shows whether each row is found in.

Can I customize the name of the indicator column?

You can customize the name of the indicator column in Pandas’ merge() function by passing a string value to the indicator parameter. Instead of the default column name _merge, the custom name you provide will be used.

Can I filter rows based on the _merge column?

You can filter rows based on the _merge column created by the indicator=True parameter in the Pandas merge() function. The _merge column contains three possible values.

Is it possible to use indicator with other join types?

The indicator parameter in Pandas’ merge() function works seamlessly with all join types: inner, outer, left, and right. The _merge column will reflect the relationship of rows between the two DataFrames based on the join type you choose.

What happens if the indicator column name already exists in the DataFrame?

If the column name already exists, Pandas will raise a ValueError. To avoid conflicts, use a unique column name or rename the existing column before merging.

Conclusion

In summary, the Pandas merge indicator is a valuable feature for tracking the origin of rows in merged DataFrames. By utilizing the indicator parameter in the merge() function, you can easily identify whether rows originate from the left DataFrame, the right DataFrame, or both.

Happy Learning!!

References