In pandas, the DataFrame corrwith()
method is used to compute the pairwise correlation between rows or columns of two DataFrame objects. This method can be particularly useful when you want to compare the similarity of two datasets by measuring the correlation of their corresponding rows or columns.
In this article, I will explain the Pandas DataFrame corrwith()
method by using its syntax, parameters, and usage, and how it returns a Series containing the correlation coefficients. This makes it easy to interpret the degree of correlation between corresponding columns or rows of the input DataFrames.
Key Points –
- The
corrwith()
method computes the pairwise correlation between rows or columns of two DataFrame objects, returning a Series of correlation coefficients. - The
drop
parameter can be set toTrue
to exclude labels with missing data from both DataFrames before computing the correlation. corrwith()
can be used to compute the correlation of each column or row in a DataFrame with a given Series, offering flexibility in comparing datasets.- When working with large datasets, be mindful of performance, as computing correlations can be computationally intensive, especially with the
kendall
andspearman
methods.
Pandas DataFrame corrwith() Introduction
Following is the syntax of the Pandas DataFrame corrwith()
# Syntax of Pandas DataFrame corrwith()
DataFrame.corrwith(other, axis=0, drop=False, method='pearson')
Parameters of the DataFrame corrwith()
Following are the parameters of the DataFrame corrwith() function.
other
– DataFrame or Series. The object to compute the correlation with.axis
– {0 or ‘index’, 1 or ‘columns’}, default 0- If
0
or'index'
, compute the correlation column-wise. - If
1
or'columns'
, compute the correlation row-wise.
- If
drop
– bool, default False. IfTrue
, drop labels with missing data in both objects before computing the correlation.method
– {‘pearson’, ‘kendall’, ‘spearman’}, default ‘pearson’pearson
– Standard correlation coefficient.kendall
– Kendall Tau correlation coefficient.spearman
– Spearman rank correlation.
Return Value
It returns Series: Correlation coefficients.
Usage of Pandas DataFrame corrwith() Method
The pandas.DataFrame.corrwith()
function is used to compute pairwise correlation between rows or columns of two DataFrame objects or between a DataFrame and a Series. This function is useful in various scenarios, such as data analysis, feature selection, and anomaly detection.
To run some examples of pandas DataFrame corrwith() function, let’s create two Pandas DataFrames using data from Python dictionaries, with columns A
, B
, and C
.
# Create DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 10, 15, 20], 'B': [2, 4, 6, 8], 'C': [3, 5, 7, 9]})
print("Create first DataFrame:\n",df)
df1 = pd.DataFrame({'A':[2, 4, 6, 8], 'B': [5, 7, 9, 11], 'C': [15, 3, 12, 8]})
print("Create Second DataFrame:\n",df1)
Yields below output.
To compute the correlation between corresponding columns of two DataFrames, you can use the corrwith()
method.
# Correlation between corresponding columns
column_correlation = df.corrwith(df1)
print("Column-wise correlation:\n", column_correlation)
In the above examples, the corrwith()
method will return a Series with the correlation coefficients for each corresponding column in df
and df1
. This example yields the below output.
Using Row-wise Correlation
Alternatively, to calculate the row-wise correlation between two DataFrames in pandas, you can use the corrwith()
method with axis=1
. This approach computes the correlation between corresponding rows across the DataFrames.
# Compute row-wise correlation
row_correlation = df.corrwith(df1, axis=1)
print("Row-wise correlation:\n", row_correlation)
# Output:
# Row-wise correlation:
# 0 -0.400732
# 1 -0.423415
# 2 -0.810885
# 3 -0.563621
# dtype: float64
Here,
df.corrwith(df1, axis=1)
calculates the correlation between corresponding rows ofdf
anddf1
.- The result,
row_correlation
, is a Series where each value represents the correlation coefficient between the corresponding rows ofdf
anddf1
.
Using Kendall Tau Correlation
To compute the Kendall Tau correlation between corresponding rows or columns of two DataFrames in pandas, you can specify the method=kendall
parameter in the corrwith()
method. Kendall Tau correlation is a measure of ordinal association between two measured quantities.
# Compute row-wise Kendall Tau correlation
kendall_correlation = df.corrwith(df1, axis=1, method='kendall')
print("Row-wise Kendall Tau correlation:\n", kendall_correlation)
# Output:
# Row-wise Kendall Tau correlation:
# 0 -0.333333
# 1 -0.333333
# 2 -0.333333
# 3 -0.816497
# dtype: float64
Here,
df.corrwith(df1, axis=1, method='kendall')
calculates the Kendall Tau correlation between corresponding rows ofdf
anddf1
.- The
method='kendall'
parameter specifies that Kendall Tau correlation should be used. - The result,
kendall_correlation
, is a Series where each value represents the Kendall Tau correlation coefficient between the corresponding rows ofdf
anddf1
.
Spearman Rank Correlation
To calculate the Spearman rank correlation between corresponding rows or columns of two DataFrames in pandas, you can use the method=spearman
parameter in the corrwith()
method. Spearman correlation evaluates the monotonic relationship between two variables, which is based on the ranks of the data rather than the raw data values.
# Compute column-wise Spearman rank correlation
spearman_correlation = df.corrwith(df1, method='spearman')
print("Column-wise Spearman rank correlation:\n", spearman_correlation)
# Output:
# Column-wise Spearman rank correlation:
# A 1.0
# B 1.0
# C -0.4
# dtype: float64
Here,
df.corrwith(df1, method='spearman')
calculates the Spearman rank correlation between corresponding columns ofdf
anddf1
.- The
method='spearman'
parameter specifies that Spearman rank correlation should be used. - The result,
spearman_correlation
, is a Series where each value represents the Spearman rank correlation coefficient between the corresponding columns ofdf
anddf1
.
Correlation with a Series
Similarly, to compute the correlation between each column of a DataFrame and a Series in pandas, you can use the corrwith()
method. This allows you to assess how each column in the DataFrame relates to the values in the Series.
# Create DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 10, 15, 20], 'B': [2, 4, 6, 8], 'C': [3, 5, 7, 9]})
# Create a Series
ser = pd.Series([3, 7, 12, 8])
# Compute correlation with Series
series_correlation = df.corrwith(ser, axis=0)
print("Correlation with Series:\n", series_correlation)
# Output:
# Correlation with Series:
# A 0.69843
# B 0.69843
# C 0.69843
# dtype: float64
Here,
df.corrwith(ser, axis=0)
computes the correlation between each column of the DataFramedf
and the Seriesser
.- The
axis=0
parameter specifies that the correlation should be computed column-wise. ser
is a Series with values[3, 7, 12, 8]
.- The result,
series_correlation
, is a Series where each value represents the correlation coefficient between the corresponding column ofdf
and the Seriesser
.
Dropping Labels with Missing Data
Finally, when using the corrwith()
method in pandas, you have the option to drop labels (rows or columns) that contain missing data (NaN values). This is controlled by the drop
parameter.
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
df = pd.DataFrame({'A': [5, 10, np.nan, 20], 'B': [2, 4, 6, 8], 'C': [3, np.nan, 7, 9]})
# Create another DataFrame
df1 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [5, 7, 9, 11], 'C': [15, 3, 12, 8]})
# Compute column-wise correlation with dropping NaN labels
df2 = df.corrwith(df1, drop=True)
print("Column-wise correlation with dropping NaN labels:\n", df2)
# Output:
# Column-wise correlation with dropping NaN labels:
# A 1.000000
# B 1.000000
# C -0.963123
# dtype: float64
Here,
df
is a DataFrame with missing values (np.nan
).df1
is another DataFrame without missing values.df.corrwith(df1, drop=True)
computes the column-wise correlation betweendf
anddf1
, while dropping labels (columns) indf
that contain missing values.- The
drop=True
parameter ensures that columns in bothdf
anddf1
with NaN values are excluded from the correlation calculation. - The result,
df2
, is a Series where each value represents the correlation coefficient between the corresponding columns ofdf
anddf1
, after dropping NaN-labeled columns indf
.
Frequently Asked Questions on Pandas DataFrame corrwith() Method
The corrwith()
method computes the correlation coefficients between corresponding columns or rows of two DataFrames or between a DataFrame and a Series.
You can use corrwith()
by calling it on a DataFrame and passing another DataFrame or Series as an argument. It calculates correlations either column-wise (axis=0
) or row-wise (axis=1
) based on your choice.
Setting drop=True
in corrwith()
excludes labels (rows or columns) with missing data (NaN values) from both DataFrames before computing the correlation.
corrwith()
can handle missing data by optionally dropping labels (drop=True
) with NaN values before calculating correlations.
The return value of corrwith()
is a Series containing the correlation coefficients between corresponding columns or rows of the input DataFrames or between a DataFrame and a Series.
Conclusion
In this article, I have explained the Pandas DataFrame corrwith()
function by using its syntax, parameters, usage, and how it returns a Series. This Series contains the correlation coefficients between the corresponding columns or rows of the input DataFrames or between the DataFrame and the Series.
Happy Learning!!
Related Articles
- Pandas DataFrame tail() Method
- Pandas DataFrame pivot() Method
- Pandas DataFrame equals() Method
- Pandas DataFrame sum() Method
- Pandas DataFrame shift() Function
- Pandas DataFrame info() Function
- Pandas DataFrame head() Method
- Pandas DataFrame pop() Method
- Pandas DataFrame cumsum() Method
- Pandas DataFrame cumprod() Method
- Pandas DataFrame product() Method
- Pandas DataFrame sample() Function
- Pandas DataFrame describe() Method
- Pandas DataFrame explode() Method