• Post author:
  • Post category:Pandas
  • Post last modified:May 30, 2024
  • Reading time:13 mins read
You are currently viewing Pandas Join DataFrames on Columns

Pandas DataFrame join() method doesn’t support joining two DataFrames on columns as join() is used for indices. However, you can convert column to index and used it on join. The best approach would be using merge() method when you wanted to join on columns. There are several methods for joining DataFrames in Pandas, but the most commonly used ones are merge() and join().

Advertisements

In this article, I will explain join pandas DataFrames on columns when joining DataFrames where the column names are the same, or when they are different, you can still use the pd.merge() function with the left_on and right_on parameters.

Key Points –

  • Pandas allow joining DataFrames on specified columns using the pd.merge() function.
  • Inner, outer, left, and right joins can be performed using different parameters of the pd.merge() function.
  • Specifying the on parameter in pd.merge() allows joining DataFrames on specific columns.
  • Pandas supports joining DataFrames with different column names by specifying left_on and right_on parameters.

Quick Examples of Pandas Join DataFrames on Columns

If you are in a hurry, below are some quick examples of how to join Pandas DataFrames on columns.


# Quick wxamples of pandas join dataframes on columns

# Pandas join on columns
# Use merge() to get efficient results.
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

# Merge on common columns
merged_df = pd.merge(df,df1)

# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])

# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])

# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")

# Set value of on parameter to specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")

First, let’s create a Pandas DataFrame using data from a dictionary.


# Create DataFrame
import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30days','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
 
print(df)
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration
0    Spark  20000   30days
1  PySpark  25000   40days
2   Python  30000   60days
3   pandas  24000   55days
4     Java  40000   50days

    Courses    Fee Percentage
0      Java  20000        10%
1   PySpark  25000        20%
2    Python  30000        25%
3    pandas  24000        20%
4  Hyperion  40000        10%

Pandas Join DataFrames on Columns

By default pandas join() method doesn’t support joining DataFrames on columns, but you can do this by converting the column you wish to join to index. To join on columns, the better approach would be using merge().


# Pandas join on columns
df3=df.set_index('Courses').join(df2.set_index('Courses'), how='inner')
print(df3)

Join on All Common Columns of DataFrame

To perform a join on all common columns of two DataFrames, you can simply use the merge() function without specifying the on parameter.

The default behavior of the merge() method is to perform a join operation on all columns that exist in both DataFrames and use an inner join.


# Merge default pandas dataframe without any key column
merged_df = pd.merge(df,df1)
print(merged_df)

# Output:
#    Courses    Fee Duration Percentage
# 0  PySpark  25000   40days        20%
# 1   Python  30000   60days        25%
# 2   pandas  24000   55days        20%

Join on Multiple Columns using merge()

Joining on multiple columns using the merge() function means that you’re combining two DataFrames based on the values in more than one column. When you specify multiple columns in the on parameter of the merge() function, pandas look for rows where the values in all specified columns match between the two DataFrames. Only the rows that satisfy this condition will be included in the resulting DataFrame.


# Use pandas.merge() on multiple columns
df2 = pd.merge(df, df1, on=['Courses','Fee'])
print(df2)

In the above example, pd.merge() is used to join df and df1 DataFrames based on multiple columns (Courses and Fee). The resulting DataFrame df2 will contain rows where both Courses and Fee match in both DataFrames.

When Column Names are Different

When the column names are different between the DataFrames, but you still want to join them based on specific columns, you can use the pd.merge() function with the left_on and right_on parameters. This allows you to specify the columns from each DataFrame that should be used for the join operation.


# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)

# Output:
#    Courses    Fee Duration Percentage
# 0    Spark  20000   30days        NaN
# 1  PySpark  25000   40days        20%
# 2   Python  30000   60days        25%
# 3   pandas  24000   55days        20%
# 4     Java  40000   50days        NaN

Complete Example of Join DataFrames


import pandas as pd
df = pd.DataFrame({'Courses': ["Spark","PySpark","Python","pandas","Java"],
                    'Fee' : [20000,25000,30000,24000,40000],
                    'Duration':['30days','40days','60days','55days','50days']})

df1 = pd.DataFrame({'Courses': ["Java","PySpark","Python","pandas","Hyperion"],
                    'Fee': [20000,25000,30000,24000,40000],
                    'Percentage':['10%','20%','25%','20%','10%']})
 
print(df)
print(df1)

# Merge default pandas DataFrame without any key column
merged_df = pd.merge(df,df1)
print(merged_df)

# Use pandas.merge() to on multiple columns
df2 = pd.merge(df, df1,  how='left', left_on=['Courses','Fee'], right_on = ['Courses','Fee'])
print(df2)

# Merge Pandas DataFrames using left_on and right_on
merged_df = pd.merge(df, df1, left_on="Courses", right_on="Courses")
print(merged_df)

# Set value of on parameter 
# To specify the key value for merge in pandas
merged_df = pd.merge(df, df1, on="Courses")
print(merged_df)

Frequently Asked Questions on Join DataFrames on Columns

What is the purpose of joining DataFrames in Pandas?

The purpose of joining DataFrames in Pandas is to combine data from different sources based on common columns or indices. By joining DataFrames, you can create a unified dataset that incorporates relevant information from multiple sources, facilitating comprehensive data analysis, exploration, and manipulation.

What are the different types of joins supported by Pandas?

Pandas support inner, outer, left, and right joins, which determine how rows from the input DataFrames are combined based on the values in the specified columns.

Can I join multiple DataFrames together?

You can join multiple DataFrames together by chaining multiple pd.merge() functions or by using the join() method with DataFrame objects.

How to perform a join on columns in Pandas?

You can use the pd.merge() function to join DataFrames on columns. Specify the on parameter to specify the column(s) to join on.

Conclusion

In this article, I have explained joining pandas DataFrames on columns with examples. The DataFrame join() method doesn’t support joining two DataFrames on columns as join() is used for indices. However, you can convert the column to index and use it on join. The best approach would be using merge() method when you wanted to join on columns.

Happy Learning !!

References