In Pandas, the merge()
function has an option called indicator
that adds a special column to the output DataFrame, which indicates the source of each row. This is useful for understanding how the merge operation affected the data. The column created is typically named _merge
and contains information on whether each row was found in the left DataFrame only, the right DataFrame only, or both.
In this article, I will explain the Pandas merge indicator by using its syntax, parameters, and usage, when you use the indicator
parameter in a Pandas merge()
function, it returns a merged DataFrame that includes an extra column to identify the source of each row. This additional column, which is named _merge
by default (or a custom name if provided), can contain one of three string values for each row.
Key Points –
- The
indicator
parameter inpd.merge()
adds a column that tracks the source of each row in the merged DataFrame. - The default indicator column is named
_merge
unless customized by the user. - The
_merge
column contains three values:left_only
,right_only
, andboth
. left_only
indicates rows that appear only in the left DataFrame.right_only
indicates rows that appear only in the right DataFrame.- The
indicator
parameter is particularly useful in outer joins for analyzing data overlap and differences. - You can rename the indicator column by passing a string value to the
indicator
parameter.
Syntax of Pandas Merge Indicator
Let’s know the syntax of the pandas merge indicator.
# Syntax of pandas merge indicator
pd.merge(left, right, how='type_of_join', on='column_name', indicator=True)
Parameters
Following are the parameters of the pandas merge indicator.
left
– The first DataFrame to merge.right
– The second DataFrame to merge.how
– Type of join to perform ('left'
,'right'
,'inner'
,'outer'
).on
– Column or index level(s) on which to merge the DataFrames.indicator
-Set toTrue
to add a default_merge
column that tracks the origin of each row. Alternatively, pass a string to rename the indicator column.
Usage of Pandas Merge Indicator
The indicator
parameter, when set to True
, adds a new column to the merged DataFrame. This column indicates the source of each row, showing whether it was found in the left DataFrame, the right DataFrame, or both.
To run some examples of the pandas merge indicator, let’s create two DataFrame.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("First DataFrames:\n", df1)
print("Second DataFrames:\n", df2)
Yields below output.
Basic Outer Join with Indicator
To perform a basic outer join with an indicator using the two DataFrames df1
and df2
, you can use the merge()
function from Pandas. The how='outer'
argument specifies that you want an outer join, and the indicator=True
argument adds a new column to the output DataFrame, indicating the source of each row.
# Performing an outer join with indicator
merged_df = pd.merge(df1, df2, on='Courses', how='outer', indicator=True)
print("Merged DataFrame with Indicator:\n", merged_df)
Here,
on='Courses'
: This specifies that the join should be performed on theCourses
column.how='outer'
: This specifies that all rows from both DataFrames should be included, withNaN
filled in where there are no matches.indicator=True
: This adds a_merge
column that shows whether the row was found inleft_only
,right_only
, orboth
.
Inner Join with Indicator
Alternatively, to perform an inner join with an indicator using the two DataFrames df1
and df2
, you can again use the merge()
function from Pandas, but this time with how='inner'
. The indicator will still show which rows were found in both DataFrames.
# Perform an inner join with indicator
df3 = pd.merge(df1, df2, on='Courses', how='inner', indicator=True)
print("Inner Merged DataFrame with Indicator:\n", df3)
# Output:
# Inner Merged DataFrame with Indicator:
# Courses Fee Duration Discount _merge
# 0 Spark 20000 30days 2000 both
# 1 Python 22000 35days 1200 both
Here,
on='Courses'
: This specifies that the join should be performed on theCourses
column.how='inner'
: This specifies that only the rows that have matching values in both DataFrames will be included.indicator=True
: This adds a_merge
column to indicate the source of each row.
Right Join with Indicator
To perform a right join with an indicator using the two DataFrames df1
and df2
, you can use the merge()
function from Pandas, specifying how='right'
. This will include all rows from the right DataFrame (df2
), and the indicator will show the source of each row.
# Perform a right join with indicator
df3 = pd.merge(df1, df2, on='Courses', how='right', indicator=True)
print("Right Merged DataFrame with Indicator:\n", df3)
# Output:
# Right Merged DataFrame with Indicator:
# Courses Fee Duration Discount _merge
# 0 Spark 20000.0 30days 2000 both
# 1 Python 22000.0 35days 1200 both
# 2 Java NaN NaN 2300 right_only
# 3 Go NaN NaN 2000 right_only
Here,
on='Courses'
: Specifies that the join should be performed on theCourses
column.how='right'
:This indicates that all rows from the right DataFrame (df2
) should be included, with matching rows from the left DataFrame (df1
).indicator=True
: Adds a_merge
column indicating where each row comes from.
Merging on Multiple Columns with Indicator
To merge two DataFrames on multiple columns with an indicator using Pandas, you can specify a list of column names for the on
parameter in the merge()
function.
import pandas as pd
# Original DataFrames
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas"],
'Fee': [20000, 25000, 22000, 30000],
'Duration': ['30days', '40days', '35days', '50days'],
'Course_Name': ['Big Data', 'Data Science', 'Programming', 'Library']
}
index_labels = ['r1', 'r2', 'r3', 'r4']
df1 = pd.DataFrame(technologies, index=index_labels)
technologies2 = {
'Courses': ["Spark", "Java", "Python", "Go"],
'Discount': [2000, 2300, 1200, 2000],
'Course_Name': ['Big Data', 'Programming', 'Programming', 'Programming']
}
index_labels2 = ['r1', 'r6', 'r3', 'r5']
df2 = pd.DataFrame(technologies2, index=index_labels2)
# Perform an outer join on multiple columns with indicator
df3 = pd.merge(df1, df2, on=['Courses', 'Course_Name'], how='outer', indicator=True)
print("Merged DataFrame with Indicator:\n", df3)
# Output:
# Merged DataFrame with Indicator:
# Courses Fee Duration Course_Name Discount _merge
# 0 Spark 20000.0 30days Big Data 2000.0 both
# 1 PySpark 25000.0 40days Data Science NaN left_only
# 2 Python 22000.0 35days Programming 1200.0 both
# 3 pandas 30000.0 50days Library NaN left_only
# 4 Java NaN NaN Programming 2300.0 right_only
# 5 Go NaN NaN Programming 2000.0 right_only
Here,
on=['Courses', 'Course_Name
']
: Specifies that the join should be performed on both Courses and Type columns.how='outer'
: This means that all rows from both DataFrames will be included, and unmatched rows will be filled withNaN
.indicator=True
: Adds a_merge
column indicating where each row comes from.
Merging with Different Key Columns and Indicator
Similarly, to merge two DataFrames with different key columns while including an indicator in the output, you can specify the column names for each DataFrame using the left_on
and right_on
parameters in the merge()
function.
import pandas as pd
# First DataFrame
technologies = {
'Courses': ["Spark", "PySpark", "Python", "pandas"],
'Fee': [20000, 25000, 22000, 30000],
'Duration': ['30days', '40days', '35days', '50days']
}
index_labels = ['r1', 'r2', 'r3', 'r4']
df1 = pd.DataFrame(technologies, index=index_labels)
# Second DataFrame with different key column
technologies2 = {
'Course_Name': ["Spark", "Java", "Python", "Go"],
'Discount': [2000, 2300, 1200, 2000]
}
index_labels2 = ['r1', 'r6', 'r3', 'r5']
df2 = pd.DataFrame(technologies2, index=index_labels2)
# Perform an outer join with different key columns and indicator
df3 = pd.merge(df1, df2, left_on='Courses', right_on='Course_Name', how='outer', indicator=True)
print("Merged DataFrame with Different Key Columns and Indicator:\n", df3)
# Output:
# Merged DataFrame with Different Key Columns and Indicator:
# Courses Fee Duration Course_Name Discount _merge
# 0 Spark 20000.0 30days Spark 2000.0 both
# 1 PySpark 25000.0 40days NaN NaN left_only
# 2 Python 22000.0 35days Python 1200.0 both
# 3 pandas 30000.0 50days NaN NaN left_only
# 4 NaN NaN NaN Java 2300.0 right_only
# 5 NaN NaN NaN Go 2000.0 right_only
Here,
left_on='Courses'
: Specifies that the key column from the left DataFrame (df1
) to join on isCourses
.right_on='Course_Name'
: Specifies that the key column from the right DataFrame (df2
) to join on isCourse_Name
.how='outer'
: Indicates that all rows from both DataFrames should be included in the result, withNaN
where there are no matches.indicator=True
: Adds a_merge
column indicating where each row comes from.
FAQ on Pandas Merge Indicator
The indicator
feature in the pandas.merge()
function is a great way to understand the origin of rows when merging two DataFrames. This feature adds a column to the resulting DataFrame that shows whether each row is found in.
You can customize the name of the indicator column in Pandas’ merge()
function by passing a string value to the indicator
parameter. Instead of the default column name _merge
, the custom name you provide will be used.
You can filter rows based on the _merge
column created by the indicator=True
parameter in the Pandas merge()
function. The _merge
column contains three possible values.
The indicator
parameter in Pandas’ merge()
function works seamlessly with all join types: inner, outer, left, and right. The _merge
column will reflect the relationship of rows between the two DataFrames based on the join type you choose.
If the column name already exists, Pandas will raise a ValueError
. To avoid conflicts, use a unique column name or rename the existing column before merging.
Conclusion
In summary, the Pandas merge indicator is a valuable feature for tracking the origin of rows in merged DataFrames. By utilizing the indicator
parameter in the merge()
function, you can easily identify whether rows originate from the left DataFrame, the right DataFrame, or both.
Happy Learning!!
Related Articles
- Pandas Join Two DataFrames
- Pandas Concatenate Two Columns
- Pandas Join DataFrames on Columns
- Pandas Join Explained With Examples
- Pandas Left Join Explained By Examples
- Differences between Pandas Join vs Merge
- How to Merge Series into Pandas DataFrame
- Pandas Outer Join Explained By Examples
- Pandas Merge Suffixes Explained Examples
- Pandas Merge DataFrames Explained Examples
- Pandas Combine Two DataFrames With Examples