Pandas DataFrame corr() Method

In Pandas, the corr() method is used to calculate pairwise correlation of columns, excluding NA/null values. This method is useful when you want to understand the linear relationship between numerical variables in your DataFrame.

Quick Examples of Pandas DataFrame corr()

If you are in a hurry, below are some quick examples of Pandas DataFrame corr() function.


# Quick examples of pandas dataframe corr()

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Compute the correlation matrix 
# Using pearson correlation
corr_matrix = df.corr(method='pearson')

# Calculate Spearman correlation coefficients
corr_matrix = df.corr(method='spearman')

# Calculate Kendall's tau correlation coefficients
corr_matrix = df.corr(method='kendall')

Pandas DataFrame corr() Introduction

Let’s know the syntax of the Pandas DataFrame corr().


# Syntax of Pandas DataFrame corr()
DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)

Parameters of the DataFrame corr()

Following are the parameters of the DataFrame corr() function.

method – This parameter specifies the method of correlation to be used. It has three possible options.
- person – Default method, computes the standard Pearson correlation coefficient.
- Kendall – Computes the Kendall Tau correlation coefficient.
- spearman – Computes the Spearman rank correlation coefficient.
min_periods – This parameter specifies the minimum number of observations required per pair of columns to have a valid result. If not provided, it defaults to 1.
numeric_only – Specifies if only numeric values should be used in the operation. By default, it is set to False.

Return Value

It returns a DataFrame containing the pairwise correlation coefficients of the columns.

Basic Correlation Matrix

To compute and display the basic correlation matrix for the given DataFrame, you can use the corr() method from Pandas.

First, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are Courses, Fee and Discount.


# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[25000,22000,26000,23000,30000],
    'Discount':[800,1300,2000,1500,1000]
          }
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)

Yields below output.

Now that we have our DataFrame, we can use the corr() method to compute the correlation matrix for the numerical columns (Fee and Discount).


# Calculate the correlation matrix
correlation_matrix = df.corr()
print("Correlation matrix:\n", correlation_matrix)

In the above example, the correlation matrix shows the pairwise correlation between Fee and Discount. The diagonal values are 1.000000, indicating a perfect correlation with themselves. The off-diagonal value of -0.210225 indicates a weak positive linear relationship between Fee and Discount.

Using Pearson Correlation

Alternatively, to explicitly use the Pearson correlation method when computing the correlation matrix in Pandas, you specify method=pearson within the corr() method.


# Compute the correlation matrix 
# Using pearson correlation
corr_matrix = df.corr(method='pearson')
print("Pearson Correlation Coefficients:\n", corr_matrix)

# Output:
# Pearson Correlation Coefficients:
#                 Fee  Discount
# Fee       1.000000 -0.210225
# Discount -0.210225  1.000000

In the above example. we create a DataFrame df with three columns, Courses, Fee, and Discount. We use the corr() method on the DataFrame df with method=pearson to compute the Pearson correlation coefficients. The resulting corr_matrix is a DataFrame where each cell represents the correlation coefficient between two columns.

Using Spearman Correlation

To calculate Spearman’s rank correlation coefficients for all columns in a Pandas DataFrame, you can use the corr() method with method='spearman'.


# Calculate Spearman correlation coefficients
corr_matrix = df.corr(method='spearman')
print("Spearman correlation coefficients:\n", corr_matrix)

# Output:
# Spearman correlation coefficients:
#            Fee  Discount
# Fee       1.0      -0.1
# Discount -0.1       1.0

In the above example, we can use the corr() method on the DataFrame df with method=spearman to compute Spearman’s rank correlation coefficients. The resulting corr_matrix is a DataFrame where each cell represents the Spearman’s rank correlation coefficient between two columns.

Using Kendall Correlation

To calculate Kendall’s tau correlation coefficients for all columns in a Pandas DataFrame, you can use the corr() method with method=kendall.


# Calculate Kendall's tau correlation coefficients
corr_matrix = df.corr(method='kendall')
print("Kendall's tau correlation coefficients:\n", corr_matrix)

# Output:
# Kendall's tau correlation coefficients:
#            Fee  Discount
# Fee       1.0       0.0
# Discount  0.0       1.0

In the above example, we can use the corr() method on the DataFrame df with method=kendall to compute Kendall’s tau correlation coefficients.

Frequently Asked Questions Pandas DataFrame corr() Method

What is the corr() method used for?

The corr() method is used to calculate the correlation between the columns of a DataFrame in pandas, which is a popular data manipulation library in Python. Correlation measures the strength and direction of the linear relationship between two variables.

How do I calculate Pearson correlation coefficients using corr()?

To calculate Pearson correlation coefficients, simply use the corr() method without specifying the method parameter, as it is the default.

How do I calculate Kendall’s tau correlation coefficients using corr()?

Specify the method parameter as 'kendall' to calculate Kendall’s tau correlation coefficients.

How does corr() handle missing values (NaNs)?

The corr() method automatically excludes NA/null values in the computation. If a pair of columns has missing values, those values are excluded from the correlation calculation.

Can I use the corr() method on a DataFrame with non-numeric columns?

The corr() method only computes correlations for numeric columns. Non-numeric columns are automatically excluded from the calculation.

Conclusion

In this article, you have learned the Pandas DataFrame corr() function by using its syntax, parameters, usage, and how you can find the correlation between the DataFrame columns using the Pearson, kendall, spearman methods.

Happy Learning!!

Reference

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html