Pandas Join Explained With Examples

pandas join() is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can be done only on indexes but not on columns. By default, it uses left join on the row index. If you wanted to join on columns you should use pandas.merge() method as this by default performs on columns

Pandas Join Key Points

  • You can join pandas Dataframes similar to joining tables in SQL.
  • pandas.DataFrame.join() method is used to join DataFrames.
  • By default it uses left join on row index. This is unlike merge() where it does inner join on common columns.
  • pandas also supports other methods like concat() and merge() to join DataFrames.

1. Quick Examples of Pandas Join

Below are some quick examples of pandas join() method


# Below are quick examples on join() method

# pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")

# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')

# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')

# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

Now, Let’s learn the syntax, usage, and execute these examples of pandas join.

2. Pandas Join Syntax & Usage

Below is the syntax of the pandas join() method.


# Syntax of pandas.DataFrame.join() method
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
  • other – Pass right DataFrame object or list of DataFrame objects.
  • on – Specify which index you wante to join on when you have multiple indexes.
  • how – Use to specify the join type. Accepts inner, left, right, outer.
  • lsuffix – Specify the left suffix string to column names
  • rsuffix – Specify the right suffix string to column names
  • sort – To specify the results to be sorted.

3. Pandas Join DataFrames Example

pandas join() method by default performs left join on row index. Let’s create two DataFrames and run the above examples to understand pandas join.


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

Yields below output.

pandas Join DataFrames

Now, perform the default join operation. Since we have a common column Courses on both DataFrames we have to specify the lsuffix (left suffix) or rsuffix (right suffix) param, not specifying these you will get an error.


# pandas join 
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)

Yields below output. As specified by default it uses left join and performs join on row index.


   Courses_left    Fee Duration Courses_right  Discount
r1        Spark  20000   30days         Spark    2000.0
r2      PySpark  25000   40days           NaN       NaN
r3       Python  22000   35days        Python    1200.0
r4       pandas  30000   50days           NaN       NaN

4. Inner Join DataFrames

pandas inner join is mostly used join, It is used to join two DataFrames on indexes. When indexes don’t match the rows get dropped from both DataFrames.


# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# Outputs
#   Courses_left    Fee Duration Courses_right  Discount
#r1        Spark  20000   30days         Spark      2000
#r3       Python  22000   35days        Python      1200

When we apply Inner join on our DataFrames, It drops DataFrame rows with indexes r2 and r4 from left DataFrame and drops indexes r5 and r6 from right DataFrame as these don’t have matches.

5. Right Join DataFrames

 Also called Right Outer Join – This join is the opposite of left join, here it returns all rows from the right DataFrame regardless of math found on the left. When the join expression doesn’t match, it assigns null for that record and drops records from left where match not found.


# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# Outputs
#   Courses_left      Fee Duration Courses_right  Discount
#r1        Spark  20000.0   30days         Spark      2000
#r6          NaN      NaN      NaN          Java      2300
#r3       Python  22000.0   35days        Python      1200
#r5          NaN      NaN      NaN            Go      2000

When we apply right join on our DataFrames, It gets all rows from right DataFrame but it assigns NaN to left DataFrame columns for rows that don’t have the match on right DataFrame.

6. Outer Join DataFrames

 Also called Full Outer Join – Returns all rows from both DataFrames. Where join expression doesn’t match it returns null on respective cells.


# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

# Outputs
#   Courses_left      Fee Duration Courses_right  Discount
#r1        Spark  20000.0   30days         Spark    2000.0
#r2      PySpark  25000.0   40days           NaN       NaN
#r3       Python  22000.0   35days        Python    1200.0
#r4       pandas  30000.0   50days           NaN       NaN
#r5          NaN      NaN      NaN            Go    2000.0
#r6          NaN      NaN      NaN          Java    2300.0

When we apply outer join on our DataFrames, It gets all rows from left and right but it assigns NaN to left and right DataFrame columns for rows that don’t have a match.

7. Pandas Join on Column

In case if you wanted to join on columns, use pandas.merge() method or set the column you wanted to join on to Index and use it. The below example demonstrates how to set the column to Index in pandas and use it for joining. df1.set_index(‘Courses’) is used to convert column to index.


# pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# Outputs
#           Fee Duration  Discount
#Courses                          
#Spark    20000   30days      2000
#Python   22000   35days      1200

You can also try the below example where it persists the original index


# pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

8. Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)

# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)


# pandas join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

Conclusion

In this article, you have learned to join DataFrames using join() method. By default join() method performs join on row indexes by using left join.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas Join Explained With Examples