• Post author:
  • Post category:Pandas
  • Post last modified:July 31, 2024
  • Reading time:14 mins read
You are currently viewing Pandas DataFrame corr() Method

In Pandas, the corr() method is used to calculate pairwise correlation of columns, excluding NA/null values. This method is useful when you want to understand the linear relationship between numerical variables in your DataFrame.

Advertisements

In this article, I will explain the Pandas DataFrame corr() method by using its syntax, parameters, usage, and how we can return a DataFrame showing the correlation coefficients between the columns.

Key Points –

  • The corr() method is used to compute the pairwise correlation of columns in a DataFrame, excluding NA/null values.
  • It supports three types of correlation methods, pearson (default), kendall, and spearman.
  • The method returns a DataFrame containing the correlation coefficients between the columns.
  • The method parameter specifies the correlation method, and the min_periods parameter specifies the minimum number of observations required per pair of columns to produce a valid result.
  • The corr() method automatically excludes NA/null values from the correlation calculation.

Quick Examples of Pandas DataFrame corr()

If you are in a hurry, below are some quick examples of Pandas DataFrame corr() function.


# Quick examples of pandas dataframe corr()

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Compute the correlation matrix 
# Using pearson correlation
corr_matrix = df.corr(method='pearson')

# Calculate Spearman correlation coefficients
corr_matrix = df.corr(method='spearman')

# Calculate Kendall's tau correlation coefficients
corr_matrix = df.corr(method='kendall')

Pandas DataFrame corr() Introduction

Let’s know the syntax of the Pandas DataFrame corr().


# Syntax of Pandas DataFrame corr()
DataFrame.corr(method='pearson', min_periods=1, numeric_only=False)

Parameters of the DataFrame corr()

Following are the parameters of the DataFrame corr() function.

  • method – This parameter specifies the method of correlation to be used. It has three possible options.
    • person – Default method, computes the standard Pearson correlation coefficient.
    • Kendall – Computes the Kendall Tau correlation coefficient.
    • spearman – Computes the Spearman rank correlation coefficient.
  • min_periods – This parameter specifies the minimum number of observations required per pair of columns to have a valid result. If not provided, it defaults to 1.
  • numeric_only – Specifies if only numeric values should be used in the operation. By default, it is set to False.

Return Value

It returns a DataFrame containing the pairwise correlation coefficients of the columns.

Basic Correlation Matrix

To compute and display the basic correlation matrix for the given DataFrame, you can use the corr() method from Pandas.

First, Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are CoursesFee and Discount.


# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[25000,22000,26000,23000,30000],
    'Discount':[800,1300,2000,1500,1000]
          }
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)

Yields below output.

pandas dataframe corr

Now that we have our DataFrame, we can use the corr() method to compute the correlation matrix for the numerical columns (Fee and Discount).


# Calculate the correlation matrix
correlation_matrix = df.corr()
print("Correlation matrix:\n", correlation_matrix)

In the above example, the correlation matrix shows the pairwise correlation between Fee and Discount. The diagonal values are 1.000000, indicating a perfect correlation with themselves. The off-diagonal value of -0.210225 indicates a weak positive linear relationship between Fee and Discount.

pandas dataframe corr

Using Pearson Correlation

Alternatively, to explicitly use the Pearson correlation method when computing the correlation matrix in Pandas, you specify method=pearson within the corr() method.


# Compute the correlation matrix 
# Using pearson correlation
corr_matrix = df.corr(method='pearson')
print("Pearson Correlation Coefficients:\n", corr_matrix)

# Output:
# Pearson Correlation Coefficients:
#                 Fee  Discount
# Fee       1.000000 -0.210225
# Discount -0.210225  1.000000

In the above example. we create a DataFrame df with three columns, Courses, Fee, and Discount. We use the corr() method on the DataFrame df with method=pearson to compute the Pearson correlation coefficients. The resulting corr_matrix is a DataFrame where each cell represents the correlation coefficient between two columns.

Using Spearman Correlation

To calculate Spearman’s rank correlation coefficients for all columns in a Pandas DataFrame, you can use the corr() method with method='spearman'.


# Calculate Spearman correlation coefficients
corr_matrix = df.corr(method='spearman')
print("Spearman correlation coefficients:\n", corr_matrix)

# Output:
# Spearman correlation coefficients:
#            Fee  Discount
# Fee       1.0      -0.1
# Discount -0.1       1.0

In the above example, we can use the corr() method on the DataFrame df with method=spearman to compute Spearman’s rank correlation coefficients. The resulting corr_matrix is a DataFrame where each cell represents the Spearman’s rank correlation coefficient between two columns.

Using Kendall Correlation

To calculate Kendall’s tau correlation coefficients for all columns in a Pandas DataFrame, you can use the corr() method with method=kendall.


# Calculate Kendall's tau correlation coefficients
corr_matrix = df.corr(method='kendall')
print("Kendall's tau correlation coefficients:\n", corr_matrix)

# Output:
# Kendall's tau correlation coefficients:
#            Fee  Discount
# Fee       1.0       0.0
# Discount  0.0       1.0

In the above example, we can use the corr() method on the DataFrame df with method=kendall to compute Kendall’s tau correlation coefficients.

Frequently Asked Questions Pandas DataFrame corr() Method

What is the corr() method used for?

The corr() method is used to calculate the correlation between the columns of a DataFrame in pandas, which is a popular data manipulation library in Python. Correlation measures the strength and direction of the linear relationship between two variables.

How do I calculate Pearson correlation coefficients using corr()?

To calculate Pearson correlation coefficients, simply use the corr() method without specifying the method parameter, as it is the default.

How do I calculate Kendall’s tau correlation coefficients using corr()?

Specify the method parameter as 'kendall' to calculate Kendall’s tau correlation coefficients.

How does corr() handle missing values (NaNs)?

The corr() method automatically excludes NA/null values in the computation. If a pair of columns has missing values, those values are excluded from the correlation calculation.

Can I use the corr() method on a DataFrame with non-numeric columns?

The corr() method only computes correlations for numeric columns. Non-numeric columns are automatically excluded from the calculation.

Conclusion

In this article, you have learned the Pandas DataFrame corr() function by using its syntax, parameters, usage, and how you can find the correlation between the DataFrame columns using the Pearson, kendall, spearman methods.

Happy Learning!!

Reference