Pandas Merge DataFrames Explained Examples

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:January 27, 2022

pandas support pandas.merge() and DataFrame.merge() to merge DataFrames which is exactly similar to SQL join and supports different types of join inner, left, right, outer, cross. By default, it uses inner join where keys don’t match the rows get dropped from both DataFrames and the result DataFrame contains rows that match on both.

In this article, I will explain how to merge DataFrames with examples like merging by columns, merging by index, merging on multiple columns, and different join types. In case if you wanted to merge on Indexes use pandas join() which default supports joining on index.

Pandas Merge Key Points

  • pandas.merge() and DataFrame.merge() works same way that merges two or more DataFrames.
  • This works similar to Database joins
  • Supports joins on rows and columns
  • When doing join on columns, it ignores indexes.
  • When joining on index, resultant DataFrame contains indexes from sources.
  • When no params are used, by default join happends on all common columns.

1. Quick Examples of pandas merge() DataFrames

Below are some quick examples of the pandas merge() method that merges two or multiple DataFrames.


# Quick Examples of pandas merge DataFrames
# pandas.merge()
df3=pd.merge(df1,df2)

# DataFrame.merge()
df3=df1.merge(df2)

# Merge by column
df3=pd.merge(df1,df2, on='Courses')

# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')

# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)

# Merge by multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])

# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')

# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')

# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')

2. Pandas Merge DataFrames

Method pandas.merge() and DataFrame.merge() are used to merge two or multiple DataFrames. Both these methods work exactly the same and they also take a similar number of params. Merging DataFrames is nothing but joining DataFrames similar to Database join.

Similar to the Database join, merge() method also supports several join types like left, right, inner, outer and cross. Use how param to specify the join type.

By default merge() uses inner join on columns that are present on both DataFrames. You can also explicitly specify the columns you wanted to join on and join by row index.

Though DataFrame also has a join() method there are slight differences between merge() and join() methods.

Now, let’s see syntax and usage with examples. By using merge you can also merge series into DataFrame.

2.1 pandas.merge() Syntax

Below is the syntax and usage of pandas.merge() method. For the latest syntax refer to pandas.merge()


# pandas.merge() Syntax
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

2.2 pandas.DataFrame.merge() Syntax

Below is the syntax and usage of pandas.DataFrame.merge() method.


# pandas.DataFrame.merge() Syntax
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Note: Both these merge() methods are similar and take the same type of arguments.

3. Pandas Merge DataFrames Examples

Let’s create two DataFrame’s to demonstrate how merge works?


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

Yields below output

pandas merge dataframes

3.1 pandas.merge() DataFrames

In my first example of merge(), I will use default params where it does inner join on the same columns presented on both DataFrames. From our DataFrame’s the common column is Courses. If you want to concat DataFrames use pandas.concat() method.


# Using pandas.merge()
df3= pd.merge(df1,df2)

# Using DataFrame.merge()
df3=df1.merge(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  20000   30days      2000
1  Python  22000   35days      1200

3.2 Merge DataFrames on Columns

Sometimes it is required to specify the column names you wanted to merge, to do so use on param. For example df3=pd.merge(df1,df2, on='Courses').


# Merge DataFrames by Columns
df3=pd.merge(df1,df2, on='Courses')

Yields same output as above first example. In case if you have different columns on left and right and want to join on these use left_on and right_on params.


# When column names are different
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)

3.3 Merge DataFrames on Index

You can also Merge DataFrames by Index using left_index and right_index params. If you notice the above examples, it ignored the index from the merged DataFrame result. Joining on Index retains the indexes.

Note that you can also use index from the left and columns from the right DataFrame and vice-versa to perform the merge.


# Merge DataFrames by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
print(df3)

Yields


   Courses_x    Fee Duration Courses_y  Discount
r1     Spark  20000   30days     Spark      2000
r3    Python  22000   35days    Python      1200

3.4 Merging DataFrames on Multiple Columns

Use param on with a list of column names when you wanted to merge DataFrames by multiple columns. It also supports different column names on the left and right DataFrames, below is an example.


# Use pandas.merge() to on multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])
print(df3)

3.5 Merging Using Different Join Types

Merge is similar the SQL join hence, it supports different join types inner, left, right, outer. By defaults, merge uses inner join. Let’s see other join types in this section.

Left Join – It is used to perform left join on DataFrames, also called Left Outer Join that returns all rows from the left DataFrame regardless of match found on the right DataFrame. When the join expression doesn’t match, it assigns null for that record and drops records from right where match not found.


# Merge by left Join
df3=pd.merge(df1,df2, on='Courses', how='left')
print(df3)

# Outputs
#    Courses    Fee Duration  Discount
# 0    Spark  20000   30days    2000.0
# 1  PySpark  25000   40days       NaN
# 2   Python  22000   35days    1200.0
# 3   pandas  30000   50days       NaN

Right Join – Also called Right Outer Join – This join is the opposite of left join, here it returns all rows from the right DataFrame regardless of math found on the left. When the join expression doesn’t match, it assigns null for that record and drops records from left where match not found.


# Merge by right Join
df3=pd.merge(df1,df2, on='Courses', how='right')
print(df3)

# Outputs
#   Courses      Fee Duration  Discount
# 0   Spark  20000.0   30days      2000
# 1    Java      NaN      NaN      2300
# 2  Python  22000.0   35days      1200
# 3      Go      NaN      NaN      2000

Outer Join – It is used to perform outer join on DataFrames, Also called Full Outer Join – Returns all rows from both DataFrames. Where join expression doesn’t match it returns null on respective cells.


# Merge by outer Join
df3=pd.merge(df1,df2, on='Courses', how='outer')
print(df3)

# Outputs
#    Courses      Fee Duration  Discount
# 0    Spark  20000.0   30days    2000.0
# 1  PySpark  25000.0   40days       NaN
# 2   Python  22000.0   35days    1200.0
# 3   pandas  30000.0   50days       NaN
# 4     Java      NaN      NaN    2300.0
# 5       Go      NaN      NaN    2000.0

5. Complete Example


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Python","pandas"],
    'Fee' :[20000,25000,22000,30000],
    'Duration':['30days','40days','35days','50days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)

technologies2 = {
    'Courses':["Spark","Java","Python","Go"],
    'Discount':[2000,2300,1200,2000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)

print(df1)
print(df2)

# pandas.merge()
df3=pd.merge(df1,df2)
print(df3)

# DataFrame.merge()
df3=df1.merge(df2)
print(df3)

# Merge by column
df3=pd.merge(df1,df2, on='Courses')
print(df3)

# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
print(df3)

# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
print(df3)

# Use pandas.merge() to on multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])
print(df3)

# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')
print(df3)

# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')
print(df3)

# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')
print(df3)

Conclusion

I hope you have learned the pandas merge() method that merges the DataFrame. Merging is a join operation that combines the columns from both DataFrames. Merge also supports joining on columns, index, and combination of both.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas Merge DataFrames Explained Examples