• Post author:
  • Post category:Pandas
  • Post last modified:December 6, 2024
  • Reading time:18 mins read
You are currently viewing Pandas DataFrame corrwith() Method

In pandas, the DataFrame corrwith() method is used to compute the pairwise correlation between rows or columns of two DataFrame objects. This method can be particularly useful when you want to compare the similarity of two datasets by measuring the correlation of their corresponding rows or columns.

Advertisements

In this article, I will explain the Pandas DataFrame corrwith() method by using its syntax, parameters, and usage, and how it returns a Series containing the correlation coefficients. This makes it easy to interpret the degree of correlation between corresponding columns or rows of the input DataFrames.

Key Points –

  • The corrwith() method computes the pairwise correlation between rows or columns of two DataFrame objects, returning a Series of correlation coefficients.
  • The drop parameter can be set to True to exclude labels with missing data from both DataFrames before computing the correlation.
  • corrwith() can be used to compute the correlation of each column or row in a DataFrame with a given Series, offering flexibility in comparing datasets.
  • When working with large datasets, be mindful of performance, as computing correlations can be computationally intensive, especially with the kendall and spearman methods.

Pandas DataFrame corrwith() Introduction

Following is the syntax of the Pandas DataFrame corrwith()


# Syntax of Pandas DataFrame corrwith()
DataFrame.corrwith(other, axis=0, drop=False, method='pearson')

Parameters of the DataFrame corrwith()

Following are the parameters of the DataFrame corrwith() function.

  • other – DataFrame or Series. The object to compute the correlation with.
  • axis – {0 or ‘index’, 1 or ‘columns’}, default 0
    • If 0 or 'index', compute the correlation column-wise.
    • If 1 or 'columns', compute the correlation row-wise.
  • drop – bool, default False. If True, drop labels with missing data in both objects before computing the correlation.
  • method – {‘pearson’, ‘kendall’, ‘spearman’}, default ‘pearson’
    • pearson – Standard correlation coefficient.
    • kendall – Kendall Tau correlation coefficient.
    • spearman – Spearman rank correlation.

Return Value

It returns Series: Correlation coefficients.

Usage of Pandas DataFrame corrwith() Method

The pandas.DataFrame.corrwith() function is used to compute pairwise correlation between rows or columns of two DataFrame objects or between a DataFrame and a Series. This function is useful in various scenarios, such as data analysis, feature selection, and anomaly detection.

To run some examples of pandas DataFrame corrwith() function, let’s create two Pandas DataFrames using data from Python dictionaries, with columns A, B, and C.


# Create DataFrame 
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[5, 10, 15, 20], 'B': [2, 4, 6, 8], 'C': [3, 5, 7, 9]})
print("Create first DataFrame:\n",df)

df1 = pd.DataFrame({'A':[2, 4, 6, 8], 'B': [5, 7, 9, 11], 'C': [15, 3, 12, 8]})
print("Create Second DataFrame:\n",df1)

Yields below output.

Pandas corrwith

To compute the correlation between corresponding columns of two DataFrames, you can use the corrwith() method.


# Correlation between corresponding columns
column_correlation = df.corrwith(df1)
print("Column-wise correlation:\n", column_correlation)

In the above examples, the corrwith() method will return a Series with the correlation coefficients for each corresponding column in df and df1. This example yields the below output.

Pandas corrwith

Using Row-wise Correlation

Alternatively, to calculate the row-wise correlation between two DataFrames in pandas, you can use the corrwith() method with axis=1. This approach computes the correlation between corresponding rows across the DataFrames.


# Compute row-wise correlation
row_correlation = df.corrwith(df1, axis=1)
print("Row-wise correlation:\n", row_correlation)

# Output:
# Row-wise correlation:
#  0   -0.400732
# 1   -0.423415
# 2   -0.810885
# 3   -0.563621
# dtype: float64

Here,

  • df.corrwith(df1, axis=1) calculates the correlation between corresponding rows of df and df1.
  • The result, row_correlation, is a Series where each value represents the correlation coefficient between the corresponding rows of df and df1.

Using Kendall Tau Correlation

To compute the Kendall Tau correlation between corresponding rows or columns of two DataFrames in pandas, you can specify the method=kendall parameter in the corrwith() method. Kendall Tau correlation is a measure of ordinal association between two measured quantities.


# Compute row-wise Kendall Tau correlation
kendall_correlation = df.corrwith(df1, axis=1, method='kendall')
print("Row-wise Kendall Tau correlation:\n", kendall_correlation)

# Output:
# Row-wise Kendall Tau correlation:
#  0   -0.333333
# 1   -0.333333
# 2   -0.333333
# 3   -0.816497
# dtype: float64

Here,

  • df.corrwith(df1, axis=1, method='kendall') calculates the Kendall Tau correlation between corresponding rows of df and df1.
  • The method='kendall' parameter specifies that Kendall Tau correlation should be used.
  • The result, kendall_correlation, is a Series where each value represents the Kendall Tau correlation coefficient between the corresponding rows of df and df1.

Spearman Rank Correlation

To calculate the Spearman rank correlation between corresponding rows or columns of two DataFrames in pandas, you can use the method=spearman parameter in the corrwith() method. Spearman correlation evaluates the monotonic relationship between two variables, which is based on the ranks of the data rather than the raw data values.


# Compute column-wise Spearman rank correlation
spearman_correlation = df.corrwith(df1, method='spearman')
print("Column-wise Spearman rank correlation:\n", spearman_correlation)

# Output:
# Column-wise Spearman rank correlation:
#  A    1.0
# B    1.0
# C   -0.4
# dtype: float64

Here,

  • df.corrwith(df1, method='spearman') calculates the Spearman rank correlation between corresponding columns of df and df1.
  • The method='spearman' parameter specifies that Spearman rank correlation should be used.
  • The result, spearman_correlation, is a Series where each value represents the Spearman rank correlation coefficient between the corresponding columns of df and df1.

Correlation with a Series

Similarly, to compute the correlation between each column of a DataFrame and a Series in pandas, you can use the corrwith() method. This allows you to assess how each column in the DataFrame relates to the values in the Series.


# Create DataFrame 
import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[5, 10, 15, 20], 'B': [2, 4, 6, 8], 'C': [3, 5, 7, 9]})

# Create a Series
ser = pd.Series([3, 7, 12, 8])

# Compute correlation with Series
series_correlation = df.corrwith(ser, axis=0)
print("Correlation with Series:\n", series_correlation)

# Output:
# Correlation with Series:
#  A    0.69843
# B    0.69843
# C    0.69843
# dtype: float64

Here,

  • df.corrwith(ser, axis=0) computes the correlation between each column of the DataFrame df and the Series ser.
  • The axis=0 parameter specifies that the correlation should be computed column-wise.
  • ser is a Series with values [3, 7, 12, 8].
  • The result, series_correlation, is a Series where each value represents the correlation coefficient between the corresponding column of df and the Series ser.

Dropping Labels with Missing Data

Finally, when using the corrwith() method in pandas, you have the option to drop labels (rows or columns) that contain missing data (NaN values). This is controlled by the drop parameter.


import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [5, 10, np.nan, 20], 'B': [2, 4, 6, 8], 'C': [3, np.nan, 7, 9]})

# Create another DataFrame
df1 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [5, 7, 9, 11], 'C': [15, 3, 12, 8]})

# Compute column-wise correlation with dropping NaN labels
df2 = df.corrwith(df1, drop=True)
print("Column-wise correlation with dropping NaN labels:\n", df2)

# Output:
# Column-wise correlation with dropping NaN labels:
#  A    1.000000
# B    1.000000
# C   -0.963123
# dtype: float64

Here,

  • df is a DataFrame with missing values (np.nan).
  • df1 is another DataFrame without missing values.
  • df.corrwith(df1, drop=True) computes the column-wise correlation between df and df1, while dropping labels (columns) in df that contain missing values.
  • The drop=True parameter ensures that columns in both df and df1 with NaN values are excluded from the correlation calculation.
  • The result, df2, is a Series where each value represents the correlation coefficient between the corresponding columns of df and df1, after dropping NaN-labeled columns in df.

Frequently Asked Questions on Pandas DataFrame corrwith() Method

What does the corrwith() method do in Pandas?

The corrwith() method computes the correlation coefficients between corresponding columns or rows of two DataFrames or between a DataFrame and a Series.

How do you use corrwith() in Pandas?

You can use corrwith() by calling it on a DataFrame and passing another DataFrame or Series as an argument. It calculates correlations either column-wise (axis=0) or row-wise (axis=1) based on your choice.

How does drop=True work in corrwith()?

Setting drop=True in corrwith() excludes labels (rows or columns) with missing data (NaN values) from both DataFrames before computing the correlation.

Can corrwith() handle missing data?

corrwith() can handle missing data by optionally dropping labels (drop=True) with NaN values before calculating correlations.

What does the return value of corrwith() represent?

The return value of corrwith() is a Series containing the correlation coefficients between corresponding columns or rows of the input DataFrames or between a DataFrame and a Series.

Conclusion

In this article, I have explained the Pandas DataFrame corrwith() function by using its syntax, parameters, usage, and how it returns a Series. This Series contains the correlation coefficients between the corresponding columns or rows of the input DataFrames or between the DataFrame and the Series.

Happy Learning!!

Reference