In Pandas, an outer join merges two DataFrames based on a common column or index and includes all rows from both DataFrames, filling in missing values with NaN (Not a Number) where data is unavailable.
In this article, I will explain Pandas outer joins the significance of this operation, and provide a comprehensive guide on performing outer joins using Pandas DataFrames, whether it’s on single or multiple columns. Given that Pandas offers various methods for combining DataFrames.
Quick Examples of Outer Join
If you are in a hurry, below are some quick examples of how to the Pandas full Outer Join DataFrames.
# Quick eamples of outer join
# Example 1: Pandas outer join two DataFrames by Index
df3 = df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
# Example 2: Pandas.merge() by Column
df3 = pd.merge(df1,df2, how='outer')
# Example 3: DataFrame.merge() by Column
df3 = df1.merge(df2, how='outer')
# Example 4: Merge DataFrames by Column
df3 = pd.merge(df1,df2, on='Courses', how='outer')
# Example 5: When column names are different
df3 = pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
# Example 6: By using concat()
df3 = pd.concat([df1,df2],axis=1,join='outer')
To run some examples of Pandas Outer Join, let’s create a DataFrames with one column in common on both DataFrames. I will use these to demonstrate Outer Join.
# Create DataFrames
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("Create DataFrame1:\n:", df1)
print("Create DataFrame2:\n", df2)
Yields below output.
Using join() Function
The pandas.DataFrame.join() method includes a how
parameter, allowing for the execution of outer joins between DataFrames. For more details, including syntax, usage, and additional examples, consult the documentation for pandas.join()
.
# Pandas join two DataFrames
df3 = df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print("After joining DataFrames by using outer join:\n", df3)
In the above examples, df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
performs an outer join on the index of both DataFrames, using the lsuffix
and rsuffix
parameters to handle potential column name conflicts and the how
parameter set to ‘outer’ to perform an outer join.
Using merge() Function
An outer join using the merge()
function combines rows from both DataFrames, including all rows from both, and filling in NaNs for missing values. When using the merge()
function without specifying additional parameters, it performs an inner join on all common columns by default. To execute an outer join, simply set the how
parameter to ‘outer’.
# Pandas.merge()
df3=pd.merge(df1,df2, how='outer')
print("After joining DataFrames by using outer join:\n", df3)
# DataFrame.merge()
df3=df1.merge(df2, how='outer')
print("After joining DataFrames by using outer join:\n", df3)
Yields below output.
# Output:
After joining DataFrames by using outer join:
Courses Fee Duration Discount
0 Spark 20000.0 30days 2000.0
1 PySpark 25000.0 40days NaN
2 Python 22000.0 35days 1200.0
3 pandas 30000.0 50days NaN
4 Java NaN NaN 2300.0
5 Go NaN NaN 2000.0
Additionally, you have the option to explicitly specify the column names for merging.
# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')
print("After joining DataFrames by using outer join:\n", df3)
If you want to combine column names that are different on two pandas DataFrames, you can use the merge()
function with different column names, but you’ll need to specify which columns to merge on explicitly.
# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print("After joining DataFrames by using outer join:\n", df3)
In the above example, left_on
specifies the column name from the left DataFrame (df1
), and right_on
specifies the column name from the right DataFrame (df2
). The merge()
function combines the DataFrames based on these specified column names.
By using pandas concat()
pandas concat() method is the least used to join two DataFrames. It is mainly used to append DataFrames Rows. however, it can also be used to join pandas DataFrames and specify outer
for join
param to perform the outer join.
# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print("After joining DataFrames by using outer join:\n", df3)
Yields below output.
# Output:
# After joining DataFrames by using outer join:
Courses Fee Duration Courses Discount
r1 Spark 20000.0 30days Spark 2000.0
r2 PySpark 25000.0 40days NaN NaN
r3 Python 22000.0 35days Python 1200.0
r4 pandas 30000.0 50days NaN NaN
r6 NaN NaN NaN Java 2300.0
r5 NaN NaN NaN Go 2000.0
Complete Examples of Outer Join
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print(df1)
print(df2)
# Pandas join two DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)
# Pandas.merge()
df3=pd.merge(df1,df2, how='outer')
# DataFrame.merge()
df3=df1.merge(df2, how='outer')
# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')
# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print(df3)
# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print(df3)
FAQ on Outer Join
An outer join in Pandas combines rows from both DataFrames, including all rows from both DataFrames, and fills in missing values with NaN where there is no match.
You can perform an outer join in Pandas using the merge()
function with the parameter how='outer'
or by using the join()
method with how='outer'
.
In an inner join, only the rows with matching values in both DataFrames are included in the result. In contrast, an outer join includes all rows from both DataFrames, filling in missing values with NaN where there is no match.
You can specify columns for an outer join in Pandas using the on
, left_on
, and right_on
parameters of the merge()
function, or by setting the index and using the join()
method.
You can perform an outer join on multiple columns by passing a list of column names to the on
, left_on
, and right_on
parameters of the merge()
function.
Conclusion
In this article, you have learned the Pandas library offers powerful tools for merging and joining DataFrames, including the merge()
function. By default, it performs inner joins on all common columns between DataFrames. However, you can customize the merge behavior by specifying parameters such as how
for different join types (e.g., outer join), on
specifying the column to join on, and left_on
and right_on
to explicitly specify column names. With these capabilities, Pandas provides flexibility for data integration and analysis across multiple datasets.
Related Articles
- Pandas Merge Multiple DataFrames
- Pandas Add Column Names to DataFrame
- Differences between Pandas Join vs Merge
- Pandas Left Join Explained By Examples
- Pandas Merge Two DataFrames