Pandas Outer Join Explained By Examples

In this pandas outer join article, I will explain the significance of outer join and how to do it with pandas DataFrames on multiple columns. Since pandas support several methods to combine DataFrames, I will cover all these with examples.

Outer join is also called Full Outer Join that returns all rows from both pandas DataFrames. Where join expression doesn’t match it returns null on respective cells.

1. Quick Examples of Pandas Outer Join DataFrames

Below are some quick examples of pandas Full Outer Join DataFrames.


# pandas outer join two DataFrames by Index
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# pandas.merge() by Column
df3=pd.merge(df1,df2, how='outer')

# DataFrame.merge() by Column
df3=df1.merge(df2, how='outer')

# Merge DataFrames by Column
df3=pd.merge(df1,df2, on='Courses', how='outer')

# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')

# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')

First, let’s create a DataFrames with one column in common on both DataFrames. I will use these to demonstrate Outer Join with examples


# Create DataFrames
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

Yields below output

pandas outer join DataFrames

2. Pandas Outer Join using join()

panads.DataFrame.join() method supports how param to join DataFrames by outer join. It also supports different params, refer to pandas join() for syntax, usage, and more examples.


# pandas join two DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

Yields below output.


   Courses_left      Fee Duration Courses_right  Discount
r1        Spark  20000.0   30days         Spark    2000.0
r2      PySpark  25000.0   40days           NaN       NaN
r3       Python  22000.0   35days        Python    1200.0
r4       pandas  30000.0   50days           NaN       NaN
r5          NaN      NaN      NaN            Go    2000.0
r6          NaN      NaN      NaN          Java    2300.0

3. Outer Join Using merge()

Using merge() you can do merging by columns, merging by index, merging on multiple columns, and different join types. By default, it joins on all common columns that exist on both DataFrames and performs an inner join, to do an outer join use how param with outer value.


# pandas.merge()
df3=pd.merge(df1,df2, how='outer')

# DataFrame.merge()
df3=df1.merge(df2, how='outer')

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  20000   30days      2000
1  Python  22000   35days      1200

You can also specify the column names explicitly.


# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')

In case if you wanted to combine column names that are different on two pandas DataFrames.


# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print(df3)

merge() also supports different params, refer to pandas merge() to learn syntax, usage with examples.

4. By using pandas concat()

pandas concat() method is the least used to join two DataFrames. It is mainly used to append DataFrames Rows. however, it can also be used to join pandas DataFrames and specify outer for join param to perform the outer join.


# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print(df3)

Yields below output.


    Courses      Fee Duration Courses  Discount
r1    Spark  20000.0   30days   Spark    2000.0
r2  PySpark  25000.0   40days     NaN       NaN
r3   Python  22000.0   35days  Python    1200.0
r4   pandas  30000.0   50days     NaN       NaN
r6      NaN      NaN      NaN    Java    2300.0
r5      NaN      NaN      NaN      Go    2000.0

5. Complete Examples of Outer Join of DataFrames


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

# pandas join two DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

# pandas.merge()
df3=pd.merge(df1,df2, how='outer')

# DataFrame.merge()
df3=df1.merge(df2, how='outer')

# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses', how='outer')

# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses', how='outer')
print(df3)

# By using concat()
df3=pd.concat([df1,df2],axis=1,join='outer')
print(df3)

Conclusion

In this article, you have learned how to perform an outer join on DataFrams by using join(), merge() and concat() methods with explanations and examples. An outer join is also called Full Outer Join that returns all rows from both DataFrames. Where join expression doesn’t match it returns null on respective cells.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas Outer Join Explained By Examples