Pandas join()
is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can be done only on indexes but not on columns. By default, it uses the left join on the row index. If you want to join on columns you should use pandas.merge() method as this by default performs on columns
Key Points –
- You can join pandas Dataframes similar to joining tables in SQL.
- pandas.DataFrame.join() method is used to join DataFrames.
- pandas also support other methods like concat() and merge() to join DataFrames.
- Joining can be performed using the
pd.merge()
function or DataFrame methods likedf.join()
. - Careful consideration of the join type and key columns is crucial to avoid unintended data duplication or loss during the joining process.
Quick Examples of Pandas Join
Following are quick examples of the pandas join() method.
# Quick examples on join() method
# Pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
Now, Let’s learn the syntax, and usage, then execute these examples of pandas joining.
Pandas Join Syntax & Usage
Below is the syntax of the pandas join() method.
# Syntax of pandas.DataFrame.join() method
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
other
– Pass the right DataFrame object or list of DataFrame objects.on
– Specify which index you want to join on when you have multiple indexes.how
– Use to specify the join type. Accepts inner, left, right, outer.lsuffix
– Specify the left suffix string to column namesrsuffix
– Specify the right suffix string to column namessort
– To specify the results to be sorted.
Pandas Join DataFrames Example
Pandas join()
method by default performs left join on row index. Let’s create two DataFrames and run the above examples to understand pandas join.
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
print("First DataFrame:\n", df1)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print("Second DataFRame:\n", df2)
Yields below output.
Now, perform the default join operation. Since we have a common column Courses on both DataFrames we have to specify the lsuffix
(left suffix) or rsuffix
(right suffix) param, not specifying these you will get an error.
# Pandas join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(After joining two DataFrames:\n", df3)
Yields below output. As specified by default it uses left join and performs join on the row index.
Inner Join DataFrames
Pandas inner
join is mostly used join, It is used to join two DataFrames on indexes. When indexes don’t match, rows from both DataFrames are dropped.
# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)
# Outputs:
# Courses_left Fee Duration Courses_right Discount
# r1 Spark 20000 30days Spark 2000
# r3 Python 22000 35days Python 1200
When we apply Inner join on our DataFrames, It drops DataFrame rows with indexes r2 and r4 from left DataFrame and drops indexes r5 and r6 from right DataFrame as these don’t have matches.
Right Join DataFrames
A right
join returns all rows from the right DataFrame and the matched rows from the left DataFrame. If no match is found, NaN is returned for columns from the left DataFrame, and rows from the left DataFrame with no matching rows in the right DataFrame are dropped.
# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)
# Outputs:
# Courses_left Fee Duration Courses_right Discount
# r1 Spark 20000.0 30days Spark 2000
# r6 NaN NaN NaN Java 2300
# r3 Python 22000.0 35days Python 1200
# r5 NaN NaN NaN Go 2000
When we apply right join on our DataFrames, It gets all rows from right DataFrame but it assigns NaN to left DataFrame columns for rows that don’t have the match on right DataFrame.
Outer Join DataFrames
An outer join returns all rows from both DataFrames. If no match is found, NaN is returned for the missing side.
# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)
# Outputs:
# Courses_left Fee Duration Courses_right Discount
# r1 Spark 20000.0 30days Spark 2000.0
# r2 PySpark 25000.0 40days NaN NaN
# r3 Python 22000.0 35days Python 1200.0
# r4 pandas 30000.0 50days NaN NaN
# r5 NaN NaN NaN Go 2000.0
# r6 NaN NaN NaN Java 2300.0
When we apply outer join on our DataFrames, It gets all rows from left and right but it assigns NaN to left and right DataFrame columns for rows that don’t have a match.
Pandas Join on Column
In case, if you want to join on columns, use pandas.merge()
method or set the column you wanted to join on to Index and use it. The below example demonstrates how to set the column to Index in pandas and use it for joining. df1.set_index('Courses')
is used to convert the column to an index.
# Pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)
# Outputs:
# Fee Duration Discount
# Courses
# Spark 20000 30days 2000
# Python 22000 35days 1200
You can also try the below example where it persists the original index
# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)
Complete Example
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Python","pandas"],
'Fee' :[20000,25000,22000,30000],
'Duration':['30days','40days','35days','50days'],
}
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)
technologies2 = {
'Courses':["Spark","Java","Python","Go"],
'Discount':[2000,2300,1200,2000]
}
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
print(df1)
print(df2)
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)
# Pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)
# Pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
print(df3)
# Pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
print(df3)
# Pandas join on column
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)
# Pandas join
df3=df1.join(df2.set_index('Courses'), how='inner', on='Courses')
print(df3)
Frequently Asked Questions
Joining DataFrames in Pandas allows you to combine data from multiple sources based on common columns or indices. It helps in merging datasets to perform analysis, compare data, or create a single unified dataset.
Performing a left join in Pandas allows you to keep all the rows from the left DataFrame and include matching rows from the right DataFrame. Rows from the left DataFrame with no corresponding match in the right DataFrame will have NaN values in the columns from the right DataFrame.
To join DataFrames based on their indices in Pandas, you can use the join()
method of one DataFrame, specifying the other DataFrame as the argument.
If you encounter duplicate keys during a join operation, you may need to specify suffixes for the overlapping column names using the suffixes
parameter in pd.merge()
or DataFrame.join()
. Alternatively, you can rename the columns before joining to avoid conflicts.
Conclusion
In this article, I have explained how to join DataFrames using the join()
method. By default join()
method performs join on row indexes by using left join.
Related Articles
- Pandas Concat Two DataFrames
- Pandas Merge DataFrames on Index
- Pandas Merge Two DataFrames
- Pandas Add Column Names to DataFrame
- Pandas Concatenate Two Columns
- Pandas Merge Multiple DataFrames
- Differences between Join vs Merge
- Left Join Explained By Examples
- Pandas Merge Two DataFrames
- Join Two DataFrames