• Post author:
  • Post category:Pandas
  • Post last modified:May 24, 2024
  • Reading time:16 mins read
You are currently viewing Pandas Join Explained With Examples

Pandas join() is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can be done only on indexes but not on columns. By default, it uses the left join on the row index. If you want to join on columns you should use pandas.merge() method as this by default performs on columns

Advertisements

Key Points –

  • You can join pandas Dataframes similar to joining tables in SQL.
  • pandas.DataFrame.join() method is used to join DataFrames.
  • pandas also support other methods like concat() and merge() to join DataFrames.
  • Joining can be performed using the pd.merge() function or DataFrame methods like df.join().
  • Careful consideration of the join type and key columns is crucial to avoid unintended data duplication or loss during the joining process.

Quick Examples of Pandas Join

Following are quick examples of the pandas join() method.


# Quick examples on join() method

# Pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")

# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')

# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')

# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

Now, Let’s learn the syntax, and usage, then execute these examples of pandas joining.

Pandas Join Syntax & Usage

Below is the syntax of the pandas join() method.


# Syntax of pandas.DataFrame.join() method
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
  • other – Pass the right DataFrame object or list of DataFrame objects.
  • on – Specify which index you want to join on when you have multiple indexes.
  • how – Use to specify the join type. Accepts inner, left, right, outer.
  • lsuffix – Specify the left suffix string to column names
  • rsuffix – Specify the right suffix string to column names
  • sort – To specify the results to be sorted.

Pandas Join DataFrames Example

Pandas join() method by default performs left join on row index. Let’s create two DataFrames and run the above examples to understand pandas join.



import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
print("First DataFrame:\n", df1)
technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("Second DataFRame:\n", df2)

Yields below output.

pandas join

Now, perform the default join operation. Since we have a common column Courses on both DataFrames we have to specify the lsuffix (left suffix) or rsuffix (right suffix) param, not specifying these you will get an error.


# Pandas join 
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(After joining two DataFrames:\n", df3)

Yields below output. As specified by default it uses left join and performs join on the row index.

pandas join

Inner Join DataFrames

Pandas inner join is mostly used join, It is used to join two DataFrames on indexes. When indexes don’t match, rows from both DataFrames are dropped.


# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# Outputs:
#   Courses_left    Fee Duration Courses_right  Discount
# r1        Spark  20000   30days         Spark      2000
# r3       Python   22000   35days        Python      1200

When we apply Inner join on our DataFrames, It drops DataFrame rows with indexes r2 and r4 from left DataFrame and drops indexes r5 and r6 from right DataFrame as these don’t have matches.

Right Join DataFrames

A right join returns all rows from the right DataFrame and the matched rows from the left DataFrame. If no match is found, NaN is returned for columns from the left DataFrame, and rows from the left DataFrame with no matching rows in the right DataFrame are dropped.


# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# Outputs:
#   Courses_left      Fee Duration Courses_right  Discount
# r1        Spark  20000.0   30days         Spark      2000
# r6          NaN      NaN      NaN          Java      2300
# r3       Python  22000.0   35days        Python      1200
# r5          NaN      NaN      NaN            Go      2000

When we apply right join on our DataFrames, It gets all rows from right DataFrame but it assigns NaN to left DataFrame columns for rows that don’t have the match on right DataFrame.

Outer Join DataFrames

An outer join returns all rows from both DataFrames. If no match is found, NaN is returned for the missing side.


# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

# Outputs:
#   Courses_left      Fee Duration Courses_right  Discount
# r1        Spark  20000.0   30days         Spark    2000.0
# r2      PySpark  25000.0   40days           NaN       NaN
# r3       Python  22000.0   35days        Python    1200.0
# r4       pandas  30000.0   50days           NaN       NaN
# r5          NaN      NaN      NaN            Go    2000.0
# r6          NaN      NaN      NaN          Java    2300.0

When we apply outer join on our DataFrames, It gets all rows from left and right but it assigns NaN to left and right DataFrame columns for rows that don’t have a match.

Pandas Join on Column

In case, if you want to join on columns, use pandas.merge() method or set the column you wanted to join on to Index and use it. The below example demonstrates how to set the column to Index in pandas and use it for joining. df1.set_index('Courses') is used to convert the column to an index.


# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# Outputs:
#           Fee Duration  Discount
# Courses                          
# Spark    20000   30days      2000
# Python   22000   35days      1200

You can also try the below example where it persists the original index


# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)

# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)


# Pandas join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

Frequently Asked Questions

What is the purpose of joining DataFrames in Pandas?

Joining DataFrames in Pandas allows you to combine data from multiple sources based on common columns or indices. It helps in merging datasets to perform analysis, compare data, or create a single unified dataset.

Can you illustrate how to do a left join in Pandas?

Performing a left join in Pandas allows you to keep all the rows from the left DataFrame and include matching rows from the right DataFrame. Rows from the left DataFrame with no corresponding match in the right DataFrame will have NaN values in the columns from the right DataFrame.

How can I join DataFrames based on indices in Pandas?

To join DataFrames based on their indices in Pandas, you can use the join() method of one DataFrame, specifying the other DataFrame as the argument.

What should I do if I encounter duplicate keys during a join operation?

If you encounter duplicate keys during a join operation, you may need to specify suffixes for the overlapping column names using the suffixes parameter in pd.merge() or DataFrame.join(). Alternatively, you can rename the columns before joining to avoid conflicts.

Conclusion

In this article, I have explained how to join DataFrames using the join() method. By default join() method performs join on row indexes by using left join.

References