• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:17 mins read
You are currently viewing Pandas Join Explained With Examples

Pandas join() is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can be done only on indexes but not on columns. By default, it uses the left join on the row index. If you want to join on columns you should use pandas.merge() method as this by default performs on columns

Key Points –

  • You can join pandas Dataframes similar to joining tables in SQL.
  • pandas.DataFrame.join() method is used to join DataFrames.
  • By default, it uses the left join on the row index. This is unlike merge() where it does inner join on common columns.
  • pandas also support other methods like concat() and merge() to join DataFrames.
  • Joining can be performed using the pd.merge() function or DataFrame methods like df.join().
  • Careful consideration of the join type and key columns is crucial to avoid unintended data duplication or loss during the joining process.

Quick Examples of Pandas Join

Below are some quick examples of pandas join() method.


# Quick examples on join() method

# Pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")

# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')

# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')

# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

Now, Let’s learn the syntax, and usage, then execute these examples of pandas joining.

Pandas Join Syntax & Usage

Below is the syntax of the pandas join() method.


# Syntax of pandas.DataFrame.join() method
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
  • other – Pass the right DataFrame object or list of DataFrame objects.
  • on – Specify which index you want to join on when you have multiple indexes.
  • how – Use to specify the join type. Accepts inner, left, right, outer.
  • lsuffix – Specify the left suffix string to column names
  • rsuffix – Specify the right suffix string to column names
  • sort – To specify the results to be sorted.

Pandas Join DataFrames Example

Pandas join() method by default performs left join on row index. Let’s create two DataFrames and run the above examples to understand pandas join.



import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
print("First DataFrame:\n", df1)
technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("Second DataFRame:\n", df2)

Yields below output.

pandas join

Now, perform the default join operation. Since we have a common column Courses on both DataFrames we have to specify the lsuffix (left suffix) or rsuffix (right suffix) param, not specifying these you will get an error.


# Pandas join 
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(After joining two DataFrames:\n", df3)

Yields below output. As specified by default it uses left join and performs join on the row index.

pandas join

Inner Join DataFrames

Pandas inner join is mostly used join, It is used to join two DataFrames on indexes. When indexes don’t match the rows get dropped from both DataFrames.


# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# Outputs:
#   Courses_left    Fee Duration Courses_right  Discount
# r1        Spark  20000   30days         Spark      2000
# r3       Python   22000   35days        Python      1200

When we apply Inner join on our DataFrames, It drops DataFrame rows with indexes r2 and r4 from left DataFrame and drops indexes r5 and r6 from right DataFrame as these don’t have matches.

Right Join DataFrames

Also called Right Outer Join – This join is the opposite of left join, here it returns all rows from the right DataFrame regardless of math found on the left. When the join expression doesn’t match, it assigns null for that record and drops records from left where match not found.


# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# Outputs:
#   Courses_left      Fee Duration Courses_right  Discount
# r1        Spark  20000.0   30days         Spark      2000
# r6          NaN      NaN      NaN          Java      2300
# r3       Python  22000.0   35days        Python      1200
# r5          NaN      NaN      NaN            Go      2000

When we apply right join on our DataFrames, It gets all rows from right DataFrame but it assigns NaN to left DataFrame columns for rows that don’t have the match on right DataFrame.

Outer Join DataFrames

 Also called Full Outer Join – Returns all rows from both DataFrames. Where join expression doesn’t match it returns null on respective cells.


# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)

# Outputs:
#   Courses_left      Fee Duration Courses_right  Discount
# r1        Spark  20000.0   30days         Spark    2000.0
# r2      PySpark  25000.0   40days           NaN       NaN
# r3       Python  22000.0   35days        Python    1200.0
# r4       pandas  30000.0   50days           NaN       NaN
# r5          NaN      NaN      NaN            Go    2000.0
# r6          NaN      NaN      NaN          Java    2300.0

When we apply outer join on our DataFrames, It gets all rows from left and right but it assigns NaN to left and right DataFrame columns for rows that don’t have a match.

Pandas Join on Column

In case, if you want to join on columns, use pandas.merge() method or set the column you wanted to join on to Index and use it. The below example demonstrates how to set the column to Index in pandas and use it for joining. df1.set_index('Courses') is used to convert the column to an index.


# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# Outputs:
#           Fee Duration  Discount
# Courses                          
# Spark    20000   30days      2000
# Python   22000   35days      1200

You can also try the below example where it persists the original index


# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)

# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)

# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)

# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)


# Pandas join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)

Frequently Asked Questions on Pandas Join

What is the purpose of joining DataFrames in Pandas?

Joining DataFrames in Pandas allows you to combine data from multiple sources based on common columns or indices. It helps in merging datasets to perform analysis, compare data, or create a single unified dataset.

How do I perform an inner join using Pandas?

You can perform an inner join using the pd.merge() function with the how='inner' parameter.

Can you illustrate how to do a left join in Pandas?

Performing a left join in Pandas allows you to keep all the rows from the left DataFrame and include matching rows from the right DataFrame. Rows from the left DataFrame with no corresponding match in the right DataFrame will have NaN values in the columns from the right DataFrame.

How can I join DataFrames based on indices in Pandas?

To join DataFrames based on their indices in Pandas, you can use the join() method of one DataFrame, specifying the other DataFrame as the argument.

What should I do if I encounter duplicate keys during a join operation?

If you encounter duplicate keys during a join operation, you may need to specify suffixes for the overlapping column names using the suffixes parameter in pd.merge() or DataFrame.join(). Alternatively, you can rename the columns before joining to avoid conflicts.

Conclusion

In this article, you have learned to join DataFrames using the join() method. By default join() method performs join on row indexes by using left join.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium