Compare Two DataFrames Row by Row

Pandas DataFrame.compare() function is used to compare given DataFrames row by row along with the specified align_axis. Sometimes we have two or more DataFrames having the same data with slight changes, in those situations we need to observe the difference between two DataFrames. By default, compare() function compares two DataFrames column-wise and returns the differences side by side. It can compare only DataFrames having the same shape with the same dimensions and having the same row indexes and column labels.

In this article, I will explain using compare() function, its syntax, and parameters how we can compare the two DataFrames row by row with examples.

1. Quick Examples of Compare Two DataFrames Row by Row

If you are in a hurry, below are some quick examples of comparing two DataFrames row by row.


# Below are quick examples

# Example 1: Compare two DataFrames row by row
diff = df.compare(df1, align_axis = 0)

# Example 2: To ignore NaN values set keep_equal=True
diff = df.compare(df1, keep_equal=True, align_axis = 0)

# Example 3: Set keep_shape = true and keep same shape 
diff = df.compare(df1, keep_shape = True, align_axis = 0)

# Example 4: Get differences of DataFrames keep equal values and shape
diff = df.compare(df1, keep_equal=True, keep_shape = True, align_axis = 0)

2. Syntax of Pandas df.compare()

Following is the syntax of pandas compare() function.


# Following is the syntax of compare() function
DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))

2.1 Parameters

Following are the parameters of the compare() function.

  • Other: It is DataFrame Object and used to compare with given DataFrame.
  • align_axis: It defines the axis of comparison. Default value is 1 for columns. If it is set with 0 for rows. For columns resulting differences are merged vertically where as, for rows resulting differences are merged horizontally.
  • keep_shape: (bool), Default value is False. If it is True, all rows and columns are existed along with different values. Otherwise, only different values are existed.
  • keep_equal :(bool) Default value is False. If it is True, keeps all equal values instead of NaN values.
  • result_names : (tuple): Default (‘self’, ‘other’)

2.2 Return Value

It returns DataFrame where, the elements are differences of given DataFrames. Resulting DataFrame having multi index with ‘self’ and ‘other’ are at inner most level of row index.

Create DataFrame

Now, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFeeDuration and Discount.


# Create DataFrame
import pandas as pd
import pandas as pd
technologies = ({
    'Courses':["Spark", "NumPY", "pandas", "Java", "PySpark"],
    'Fee' :[20000,25000,30000,22000,26000],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
technologies1 = ({
    'Courses':["Spark", "Hadoop", "pandas", "Java", "PySpark"],
    'Fee' :[20000,24000,30000,22000,21000],
    'Duration':['30days','40days','35days','60days','50days'],
    'Discount':[1000,2500,1500,1200,3000]
               })
df = pd.DataFrame(technologies)
print("DataFrame1:\n", df)
df1 = pd.DataFrame(technologies1)
print("DataFrame2:\n", df1)    

Yields below output.


DataFrame1:
    Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1   NumPy  25000   40days      2500
2   pandas  30000   35days      1500
3     Java  22000   60days      1200
4  PySpark  26000   50days      3000
DataFrame2:
    Courses    Fee Duration  Discount
0    Spark  20000   30days      1000
1    Hadoop  24000   40days      2500
2    pandas  30000   35days      1500
3     Java  22000   60days      1200
4  PySpark  21000   50days      3000

3. Usage of Pandas DataFrame.compare() Function.

Pandas DataFrame.compare() function compares two equal sizes and dimensions of DataFrames row by row along with align_axis = 0 and returns The DataFrame with unequal values of given DataFrames. By default, it compares the DataFrames column by column. If we want to get same sized resulting DataFrame we can use its parameter keep_shape and use keep_equal param to avoid NaN values in resulting DataFrame.

Let’s use compare() function on given DataFrames along with align_axis=0 to find the difference between two DataFrames row by row.


# Comparing the two DataFrames row by row
diff = df.compare(df1, align_axis = 0)
print("Difference between two DataFrames:\n", diff)

Yields below output.


# Output:
# compare two DataFrames:
         Courses      Fee
1 self    NumPy  25000.0
  other  Hadoop  24000.0
4 self      NaN  26000.0
  other     NaN  21000.0

As we can see from the above, differences have been added one by one in the resultant DataFrame.

4. Pass keep_equal into compare() & Compare

As we can see from the above, the resulting Dataframe has been obtained where, equal values are treated as NaN values. So, overcome the NaN values by setting keep_equal as True then and pass into compare() function. It will override the NaN values with equal values of given DataFrames.


# Ignore NaN values pass keep_equal=True
diff = df.compare(df1, keep_equal=True, align_axis = 0)
print(diff)

Yields below output.


# Output:
         Courses    Fee
1 self     NumPy  25000
  other   Hadoop  24000
4 self   Pyspark  26000
  other  Pyspark  21000

5. Pass keep_shape into compare() & Compare Pandas Row by Row

If we want to get the same sized resulting DataFrame, we can set keep_shape is True then pass into compare() function. It will return the same sized DataFrame where, equal values are treated as NaN values. For example,


# Set keep_shape = true and keep same shape 
diff = df.compare(df1, keep_shape = True, align_axis = 0)
print(diff)

Yields below output.


# Output:
        Courses      Fee Duration  Discount
0 self      NaN      NaN      NaN       NaN
  other     NaN      NaN      NaN       NaN
1 self    NumPy  25000.0      NaN       NaN
  other  Hadoop  24000.0      NaN       NaN
2 self      NaN      NaN      NaN       NaN
  other     NaN      NaN      NaN       NaN
3 self      NaN      NaN      NaN       NaN
  other     NaN      NaN      NaN       NaN
4 self      NaN  26000.0      NaN       NaN
  other     NaN  21000.0      NaN       NaN

6. Pass keep_equal & keep_shape into compare()

Set keep_shape and keep_equal as True and pass them into compare() function it will return the same-sized resulting DataFrame along with equal values of given DataFrames.


# Get differences of DataFrames keep equal values and shape
diff = df.compare(df1, keep_equal=True, keep_shape = True, align_axis = 0)
print(diff)

Yields below output.


# Output:
         Courses    Fee Duration  Discount
0 self     Spark  20000   30days      1000
  other    Spark  20000   30days      1000
1 self     NumPy  25000   40days      2500
  other   Hadoop  24000   40days      2500
2 self    pandas  30000   35days      1500
  other   pandas  30000   35days      1500
3 self      Java  22000   60days      1200
  other     Java  22000   60days      1200
4 self   Pyspark  26000   50days      3000
  other  Pyspark  21000   50days      3000

7. Conclusion

In this article, I have explained using DataFrame.compare() function along with align_axis, its syntax, and parameters how we can compare the two DataFrames row by row with examples

Related Articles

Reference

Leave a Reply

You are currently viewing Compare Two DataFrames Row by Row