Pandas Difference Between Two DataFrames

Pandas DataFrame.compare() function is used to show the difference between two DataFrames column by column or row by row. Sometimes we have two or more DataFrames having the same data with slight changes, in those situations we need to observe the difference between those DataFrames.

By default compare() function compares two DataFrames column-wise and returns the differences side by side. It can compare only DataFrames having the same shape with the same dimensions and having the same row indexes and column labels. In this article, I will explain using compare() function, its syntax, and parameters how we can compare the two DataFrames with examples.

1. Quick Examples of Difference Between Two DataFrames

If you are in a hurry, below are some quick examples of differences between two Pandas DataFrames.


# Below are quick examples

# Example 1: Compare two DataFrames
diff = df.compare(df1)

# Example 2: To ignore NaN values set keep_equal=True
diff = df.compare(df1, keep_equal=True)

# Example 3: Set keep_shape = true and keep same shape 
diff = df.compare(df1, keep_shape = True)

# Example 4: Get differences of DataFrames keep equal values and shape
diff = df.compare(df1, keep_equal=True, keep_shape = True)

2. Syntax of DataFrame compare()

Following is the syntax of compare() function to find the differences of DataFrames.


# Syntax of compare() function
DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))

2.1 Parameters

Following are the parameters of the compare() function.

  • Other: It is DataFrame Object and used to compare with given DataFrame.
  • align_axis: It defines the axis of comparison. The default value is 1 for columns. If it is set with 0 for rows. For columns resulting differences are merged vertically whereas, for rows resulting differences are merged horizontally.
  • keep_shape: (bool), the Default value is False. If it is True, all rows and columns are existed along with different values. Otherwise, only different values exist.
  • keep_equal :(bool) Default value is False. If it is True, keeps all equal values instead of NaN values.
  • result_names : (tuple): Default (‘self’, ‘other’)

2.2 Return Value

It returns DataFrame where, the elements are differences of given DataFrames. Resulting DataFrame has a multi-index with ‘self’ and ‘other’ are at the innermost level of the column label.

Create DataFrame

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


# Create DataFrame
import pandas as pd
import pandas as pd
technologies = ({
    'Courses':["Spark", "NumPY", "pandas", "Java", "PySpark"],
    'Fee' :[20000,25000,30000,22000,26000],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
technologies1 = ({
    'Courses':["Spark", "Hadoop", "pandas", "Java", "PySpark"],
    'Fee' :[20000,24000,30000,22000,21000],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = pd.DataFrame(technologies)
print("DataFrame1:\n", df)
df1 = pd.DataFrame(technologies1)
print("DataFrame2:\n", df1)    

Yields below output.


# Output:
DataFrame1:
    Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1   NumPy  25000   40days      2500
2   pandas  30000   35days      1500
3     Java  22000   60days      1200
4  PySpark  26000   50days      3000
DataFrame2:
    Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1    Hadoop  24000   40days      2500
2    pandas  30000   35days      1500
3     Java  22000   60days      1200
4  PySpark  21000   50days      3000

3. Usage of Pandas DataFrame.compare()

Pandas DataFrame.compare() function compares two equal sizes and dimensions of DataFrames column-wise and returns the differences. Set align_axis is True to compare the DataFrames row by row. If we want to get same sized resulting DataFrame we can use its parameter keep_shape and use keep_equal param to avoid NaN values in the resulting DataFrame.

Let’s use compare() function on given DataFrames to find the difference between two DataFrames.


# Compare two DataFrames
diff = df.compare(df1)
print("Difference between two DataFrames:\n", diff)

Yields below output.


# Output:
# Difference between two DataFrames:
   Courses              Fee         
     self   other     self    other
1   NumPy  Hadoop  25000.0  24000.0
4     NaN     NaN  26000.0  21000.0

As we can see from the above, differences have been added side by side in the resultant DataFrame.

4. Use keep_equal to Get Pandas Difference

In the above example, the resulting Dataframe has been obtained where equal values are treated as NaN values. So, to overcome the NaN values set keep_equal as True and pass into compare() function. It will override the NaN values with equal values of given DataFrames.


# To ignore NaN values set keep_equal=True
diff = df.compare(df1, keep_equal=True)
print(diff)

Yields below output.


   Courses             Fee       
      self    other   self  other
1    NumPy   Hadoop  25000  24000
4  PySpark  PySpark  26000  21000

5. Using keep_shape to Get Pandas Differences

If we want to get the same sized resulting DataFrame, we can set keep_shape is True then pass into compare() function. It will return the same sized DataFrame where equal values are treated as NaN values. For example,


# Set keep_shape = true and keep same shape 
diff = df.compare(df1, keep_shape = True)
print(diff)

Yields below output.


# Output:
  Courses              Fee          Duration       Discount      
     self   other     self    other     self other     self other
0     NaN     NaN      NaN      NaN      NaN   NaN      NaN   NaN
1   NumPy  Hadoop  25000.0  24000.0      NaN   NaN      NaN   NaN
2     NaN     NaN      NaN      NaN      NaN   NaN      NaN   NaN
3     NaN     NaN      NaN      NaN      NaN   NaN      NaN   NaN
4     NaN     NaN  26000.0  21000.0      NaN   NaN      NaN   NaN

6. Using keep_equal & keep_shape

Set keep_shape and keep_equal as True and pass them into compare() function to return the same-sized resulting DataFrame along with equal values of given DataFrames.


# Get differences of DataFrames keep equal values and shape
diff = df.compare(df1, keep_equal=True, keep_shape = True)
print(diff)

Yields below output.


# Output:
   Courses             Fee        Duration         Discount      
      self    other   self  other     self   other     self other
0    Spark    Spark  20000  20000   30days  30days     1000  1000
1    NumPy   Hadoop  25000  24000   40days  40days     2500  2500
2   pandas   pandas  30000  30000   35days  35days     1500  1500
3     Java     Java  22000  22000   60days  60days     1200  1200
4  PySpark  PySpark  26000  21000   50days  50days     3000  3000

7. Conclusion

In this article, I have explained using DataFrame.compare() function, its syntax, parameters and how to compare the two DataFrames with examples.

Related Articles

References

Leave a Reply

You are currently viewing Pandas Difference Between Two DataFrames