Pandas Merge Multiple DataFrames

Pandas merge() function is used to merge multiple Dataframes. We can use either pandas.merge() or DataFrame.merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join innerleftrightoutercross.

In this article, we will learn how to merge multiple (three or more) Dataframes with examples.



import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)
technologies3 = {
    'Instructor': ["Jerry", "Tom", "Smith", "John"],
    'Courses': ["Spark", "Java", "Python", "Scala"]
}
index_labels3 = ['r1','r6','r3','r7']
df3 = pd.DataFrame(technologies3, index=index_labels3)

print(df1)
print(df2)
print(df3)

Yields below the output of three DataFrames.


  Courses    Fee Duration
r1    Spark  20000   30days
r2  PySpark  25000   40days
r3   Python  22000   35days
r4   pandas  30000   50days
   Courses  Discount
r1   Spark      2000
r6    Java      2300
r3  Python      1200
r5      Go      2000
   Instructor Courses
r1      Jerry   Spark
r6        Tom    Java
r3      Smith  Python
r7       John   Scala

1. pandas Merge on Multiple DataFrames Example

Method pandas.merge() and DataFrame.merge() are used to merge two or multiple DataFrames. Both these methods work exactly the same and they also take a similar number of params. Merging DataFrames is nothing but joining DataFrames similar to Database join.

Similar to the Database join, merge() method also supports several join types like leftrightinnerouter and cross. Use how param to specify the join type.

By default merge() uses inner join on columns that are present on both DataFrames. You can also explicitly specify the columns you wanted to join on and join by row index


df4 = pd.merge(pd.merge(df1,df2,on='Courses'),df3,on='Courses')
print(df4)

Yields below output. You can use the same approach to merge more than three DataFrames.


  Courses    Fee Duration  Discount Instructor
0   Spark  20000   30days      2000      Jerry
1  Python  22000   35days      1200      Smith

Alternatively, you can also use DataFrame.merge() to join multiple pandas DataFrames.


# By using DataFrame.merge()
df5 = df1.merge(df2,on='Courses').merge(df3,on='Courses')
print(df5)

Yields same output as above.

2.Join Multiple DataFrames Using Left Join

In the above example merge of three Dataframes is done on the “Courses ” column. If we don’t specify also the merge will be done on the “Courses” column, the default behavior(join on inner) because the only common column on three Dataframes is “Courses”.

Below is the example of merge() using left join


# -*- coding: utf-8 -*-
"""
author SparkByExamples.com
"""

#DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

import pandas as pd
technologies = {
    'Courses': ["Spark", "PySpark", "Python", "pandas"],
    'Fee': [20000, 25000, 22000, 30000],
    'Duration': ['30days', '40days', '35days', '50days'],
}
index_labels = ['r1', 'r2', 'r3', 'r4']
df1 = pd.DataFrame(technologies, index=index_labels)

technologies2 = {
    'Courses': ["Spark", "Java", "Python", "Go"],
    'Discount': [2000, 2300, 1200, 2000]
}
index_labels2 = ['r1', 'r6', 'r3', 'r5']
df2 = pd.DataFrame(technologies2, index=index_labels2)


technologies3 = {
    'Instructor': ["Jerry", "Tom", "Smith", "John"],
    'Courses': ["Spark", "Java", "Python", "Scala"]
}
index_labels3 = ['r1','r6','r3','r7']
df3 = pd.DataFrame(technologies3, index=index_labels3)

print(df1)
print(df2)
print(df3)

#df4 = pd.merge(pd.merge(df1,df2,on='Courses'),df3,on='Courses')
#print(df4)

#df5 = df1.merge(df2,on='Courses').merge(df3,on='Courses')
#print(df5)

df6 = df1.merge(df2,how ='left').merge(df3,how ='left')
print(df6)

The above example gives the following output


    Courses    Fee Duration
r1    Spark  20000   30days
r2  PySpark  25000   40days
r3   Python  22000   35days
r4   pandas  30000   50days
   Courses  Discount
r1   Spark      2000
r6    Java      2300
r3  Python      1200
r5      Go      2000
   Instructor Courses
r1      Jerry   Spark
r6        Tom    Java
r3      Smith  Python
r7       John   Scala

   Courses    Fee Duration  Discount Instructor
0    Spark  20000   30days    2000.0      Jerry
1  PySpark  25000   40days       NaN        NaN
2   Python  22000   35days    1200.0      Smith
3   pandas  30000   50days       NaN        NaN

Conclusion

So far we have learned how to merge pandas multiple Dataframes using pd.merge() and df. merge(). Merging is a join operation that combines the columns from multiple DataFrames based on conditions specified in the ‘on’ or ‘how’ clause. Merge also supports joining on columns, index, and combination of both.

Happy Learning !!

You May Also Like Reading

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

Pandas Merge Multiple DataFrames