Pandas DataFrame join() method doesn’t support joining two DataFrames on columns as join() is used for indices. However, you can convert column to index and used it on join. The best approach would be using merge() method when you wanted to join on columns.
In this article, I will explain how to join pandas DataFrames on columns when columns on the left and right DataFrames are the same and when column names are different by using join() and merge() methods.
Key Points –
- Pandas allow joining DataFrames on specified columns using the
pd.merge()
function. - Inner, outer, left, and right joins can be performed using different parameters of the
pd.merge()
function. - Specifying the on parameter in
pd.merge()
allows joining DataFrames on specific columns. - Pandas supports joining DataFrames with different column names by specifying
left_on
andright_on
parameters.
1. Quick Examples of Pandas Join DataFrames on Columns
If you are in a hurry, below are some quick examples of how to join Pandas DataFrames on multiple columns.
# Below are the quick examples
# Pandas join on columns.
# Use merge() to get efficient results.
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
# Merge on common columns
merged_df = pd.merge(df,df1)
# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
# Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
Now, let’s create a DataFrame with a few rows and columns, execute these examples, and validate the results. The first DataFrame contains column names Courses, Fee, Duration and the second DataFrame contains column names Courses
,Fee
,Percentage
.
# Create DataFrame
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
'Fee' : [20000,25000,30000,24000,40000],
'Duration':['30day','40days','60days','55days','50days']})
df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
'Fee': [20000,25000,30000,24000,40000],
'Percentage':['10%','20%','25%','20%','10%']})
print(df)
print(df1)
Yields below output.
# Output:
Courses Fee Duration
0 Spark 20000 30day
1 PySpark 25000 40days
2 Python 30000 60days
3 pandas 24000 55days
4 Java 40000 50days
Courses Fee Percentage
0 Java 20000 10%
1 PySpark 25000 20%
2 Python 30000 25%
3 pandas 24000 20%
4 Hyperion 40000 10%
2. Pandas Join DataFrames on Columns
By default pandas join() method doesn’t support joining DataFrames on columns, but you can do this by converting the column you wish to join to index. To join on columns, the better approach would be using merge().
# Pandas join on columns
df3=df.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)
3. Join on All Common Columns of DataFrame
By default, the merge() method applies join contains on all columns that are present on both DataFrames and uses inner join.
# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)
Yields below output.
# Output:
Courses Fee Duration Percentage
0 PySpark 25000 40days 20%
1 Python 30000 60days 25%
2 pandas 24000 55days 20%
4. Join on Multiple Columns using merge()
You can also explicitly specify the column names you wanted to use for joining. To use column names use on
param. This also takes a list of names when you wanted to join on multiple columns.
# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
print(df2)
Yields the same output as above.
5. When Column Names are Different
When you have column names on left and right are different and want to use these as a join column, use left_on
and right_on
parameters. This also takes a list of column names as values to join on multiple columns. The left_on
will be set to the name of the column in the left DataFrame and right_on
will be set to the name of the column in the right DataFrame.
# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)
Yields below output.
# Output:
Courses Fee Duration Percentage
0 Spark 20000 30day NaN
1 PySpark 25000 40days 20%
2 Python 30000 60days 25%
3 pandas 24000 55days 20%
4 Java 40000 50days NaN
6. Complete Example of Join DataFrames on Columns
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
'Fee' : [20000,25000,30000,24000,40000],
'Duration':['30day','40days','60days','55days','50days']})
df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
'Fee': [20000,25000,30000,24000,40000],
'Percentage':['10%','20%','25%','20%','10%']})
print(df)
print(df1)
# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)
# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1, how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)
# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
print(merged_df)
# Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)
Frequently Asked Questions on Pandas Join DataFrames on Columns
The purpose of joining DataFrames in Pandas is to combine data from different sources based on common columns or indices. By joining DataFrames, you can create a unified dataset that incorporates relevant information from multiple sources, facilitating comprehensive data analysis, exploration, and manipulation.
Pandas supports inner, outer, left, and right joins, which determine how rows from the input DataFrames are combined based on the values in the specified columns.
You can use the pd.merge()
function and specify the DataFrames to join (left
and right
), the columns to join on (on
), and the type of join (how
, defaulting to ‘inner’).
Pandas allows you to join DataFrames with different column names by specifying the left_on
and right_on
parameters in the pd.merge()
function.
You can join multiple DataFrames together by chaining multiple pd.merge()
functions or by using the join()
method with DataFrame objects.
Conclusion
In this article, I have explained joining pandas DataFrames on columns with examples. DataFrame join() method doesn’t support joining two DataFrames on columns as join() is used for indices. However, you can convert column to index and used it on join. The best approach would be using merge() method when you wanted to join on columns.
Happy Learning !!
Related Articles
- Pandas Difference Between map, applymap and apply Methods
- Combine Two Pandas DataFrames With Examples
- Convert Pandas DataFrame to Dictionary (Dict)
- Iterate Over Columns of Pandas DataFrame
- Pandas Get Total | Sum of Column
- Differences between Pandas Join vs Merge
- Pandas Join Two DataFrames
- Pandas DataFrame insert() Function
- pandas.DataFrame.where() Examples
- Pandas Join Explained With Examples
- Pandas Outer Join Explained By Examples
- pandas rolling() Mean, Average, Sum Examples