• Post author:
  • Post category:Pandas
  • Post last modified:May 14, 2024
  • Reading time:14 mins read
You are currently viewing Pandas Outer Join Explained By Examples

In Pandas, an outer join merges two DataFrames based on a common column or index and includes all rows from both DataFrames, filling in missing values with NaN (Not a Number) where data is unavailable.

Advertisements

In this article, I will explain Pandas outer joins the significance of this operation, and provide a comprehensive guide on performing outer joins using Pandas DataFrames, whether it’s on single or multiple columns. Given that Pandas offers various methods for combining DataFrames.

Quick Examples of Outer Join

If you are in a hurry, below are some quick examples of how to the Pandas full Outer Join DataFrames.


# Quick eamples of outer join 

# Example 1: Pandas outer join two DataFrames by Index
df3 = df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# Example 2: Pandas.merge() by Column
df3 = pd.merge(df1,df2, how='outer')

# Example 3: DataFrame.merge() by Column
df3 = df1.merge(df2, how='outer')

# Example 4: Merge DataFrames by Column
df3 = pd.merge(df1,df2, on='Courses', how='outer')

# Example 5: When column names are different
df3 = pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')

# Example 6: By using concat()
df3 = pd.concat([df1,df2],axis=1,join='outer')

To run some examples of Pandas Outer Join, let’s create a DataFrames with one column in common on both DataFrames. I will use these to demonstrate Outer Join.


# Create DataFrames
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print("Create DataFrame1:\n:", df1)
print("Create DataFrame2:\n", df2)

Yields below output.

pandas outer join dataframes

Using join() Function

The pandas.DataFrame.join() method includes a how parameter, allowing for the execution of outer joins between DataFrames. For more details, including syntax, usage, and additional examples, consult the documentation for pandas.join().


# Pandas join two DataFrames
df3 = df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print("After joining DataFrames by using outer join:\n", df3)

In the above examples, df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer') performs an outer join on the index of both DataFrames, using the lsuffix and rsuffix parameters to handle potential column name conflicts and the how parameter set to ‘outer’ to perform an outer join.

pandas outer join dataframes

Using merge() Function

An outer join using the merge() function combines rows from both DataFrames, including all rows from both, and filling in NaNs for missing values. When using the merge() function without specifying additional parameters, it performs an inner join on all common columns by default. To execute an outer join, simply set the how parameter to ‘outer’.


# Pandas.merge()
df3=pd.merge(df1,df2, how='outer')
print("After joining DataFrames by using outer join:\n", df3)

# DataFrame.merge()
df3=df1.merge(df2, how='outer')
print("After joining DataFrames by using outer join:\n", df3)

Yields below output.


# Output:
After joining DataFrames by using outer join:
    Courses      Fee Duration  Discount
0    Spark  20000.0   30days    2000.0
1  PySpark  25000.0   40days       NaN
2   Python  22000.0   35days    1200.0
3   pandas  30000.0   50days       NaN
4     Java      NaN      NaN    2300.0
5       Go      NaN      NaN    2000.0

Additionally, you have the option to explicitly specify the column names for merging.


# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')
print("After joining DataFrames by using outer join:\n", df3)

If you want to combine column names that are different on two pandas DataFrames, you can use the merge() function with different column names, but you’ll need to specify which columns to merge on explicitly.


# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print("After joining DataFrames by using outer join:\n", df3)

In the above example, left_on specifies the column name from the left DataFrame (df1), and right_on specifies the column name from the right DataFrame (df2). The merge() function combines the DataFrames based on these specified column names.

By using pandas concat()

pandas concat() method is the least used to join two DataFrames. It is mainly used to append DataFrames Rows. however, it can also be used to join pandas DataFrames and specify outer for join param to perform the outer join.


# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print("After joining DataFrames by using outer join:\n", df3)

Yields below output.


# Output:
# After joining DataFrames by using outer join:
    Courses      Fee Duration Courses  Discount
r1    Spark  20000.0   30days   Spark    2000.0
r2  PySpark  25000.0   40days     NaN       NaN
r3   Python  22000.0   35days  Python    1200.0
r4   pandas  30000.0   50days     NaN       NaN
r6      NaN      NaN      NaN    Java    2300.0
r5      NaN      NaN      NaN      Go    2000.0

Complete Examples of Outer Join of DataFrames


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

# Pandas join two DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

# Pandas.merge()
df3=pd.merge(df1,df2, how='outer')

# DataFrame.merge()
df3=df1.merge(df2, how='outer')

# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')

# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print(df3)

# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print(df3)

FAQ on Outer Join

What is an outer join in Pandas?

An outer join in Pandas combines rows from both DataFrames, including all rows from both DataFrames, and fills in missing values with NaN where there is no match.

How do you perform an outer join in Pandas?

You can perform an outer join in Pandas using the merge() function with the parameter how='outer' or by using the join() method with how='outer'.

What is the difference between inner join and outer join?

In an inner join, only the rows with matching values in both DataFrames are included in the result. In contrast, an outer join includes all rows from both DataFrames, filling in missing values with NaN where there is no match.

Can you specify columns for an outer join in Pandas?

You can specify columns for an outer join in Pandas using the on, left_on, and right_on parameters of the merge() function, or by setting the index and using the join() method.

Can you perform an outer join on multiple columns?

You can perform an outer join on multiple columns by passing a list of column names to the on, left_on, and right_on parameters of the merge() function.

Conclusion

In this article, you have learned the Pandas library offers powerful tools for merging and joining DataFrames, including the merge() function. By default, it performs inner joins on all common columns between DataFrames. However, you can customize the merge behavior by specifying parameters such as how for different join types (e.g., outer join), on specifying the column to join on, and left_on and right_on to explicitly specify column names. With these capabilities, Pandas provides flexibility for data integration and analysis across multiple datasets.

References