• Post author:
  • Post category:Pandas
  • Post last modified:November 11, 2024
  • Reading time:14 mins read
You are currently viewing Pandas Correlation of Columns

pandas.DataFrame.corr() function can be used to get the correlation between two or more columns in DataFrame. Correlation is used to analyze the strength and direction between two quantitative variables. It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association and a negative value for r indicates a negative association

Advertisements

In this article, I will explain how to get the correlation between two columns with several examples.

Key Points –

  • The corr() method is used to calculate the Pearson correlation coefficient between numerical columns in a DataFrame, which measures the linear relationship between columns.
  • Calling df.corr() on a DataFrame returns a correlation matrix that shows the pairwise correlation between all numeric columns.
  • By default, corr() calculates the Pearson correlation, which ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.
  • Pandas corr() also supports other correlation methods, such as Kendall and Spearman, which can be specified using the method parameter.
  • You can compute the correlation between specific columns by selecting them before applying corr(), or by indexing into the correlation matrix.
  • The corr() method only works with numeric data, meaning it will ignore non-numeric columns during the correlation calculation.

Quick Examples of Correlation of Columns

If you are in hurry below are some quick examples of pandas correlation between two columns.


# Quick examples of correlation of columns

# Example 1: Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])

# Example 2: Correlation between all the columns of DataFrame
df2=df.corr()

# Example 3: Other example
df['Discount']=np.float64(df['Fee'])

Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration.


# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Discount':[1500,1000,1200,800,1300],
    'Duration':['30days','50days','30days','35days','40days']
          }
df = pd.DataFrame(technologies)
print("Create DataFrame:\n",df)

Yields below output.

pandas correlation columns

DataFrame corr() correlation Syntax

Following is the syntax of the DataFrame.corr() function.


# Syntax of corre() 
DataFrame.corr(method='pearson', min_periods=1)

Correlation Between Two Columns of DataFrame

You can see the correlation between two columns of pandas DataFrame by using DataFrame.corr() function. The pandas.DataFrame.corr() is used to find the pairwise correlation of all columns in the DataFrame. For example, let’s see what is the correlation between Fee and Discount.


# Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
print("Correlation between two columns:\n",corr)

Yields below output.

pandas correlation columns

We get -0.35 as the correlation between the scores of Fee and Discount. This indicates that the two columns highly correlated in a negative direction.

Correlation Between All the Columns of DataFrame

You can also get the correlation between all the columns of a pandas DataFrame. For this, apply corr() function on the entire DataFrame which will result in a DataFrame of pair-wise correlation values between all the columns.

Note that by default, the corr() function returns Pearson’s correlation.


# Correlation between all the columns of DataFrame.
df2=df.corr()
print(df2)

Yields below output.


# Output:
               Fee  Discount
Fee       1.000000 -0.351123
Discount -0.351123  1.000000

When applied to an entire DataFrame, the corr() function returns a DataFrame of pair-wise correlation between the columns. We can see that there’s a weak negative correlation between scores of Fee/Discount. Also, notice that the values on the diagonal are 1s, this is because each column is perfectly correlated with itself.

Other Example

In this example, if Fee is float type, python skips it by default. All the other columns of DataFrame are in numpy-formats. so, you can do it by converting the column to np.float64.


# Other example
df['Fee']=np.float64(df['Fee'])
print(df)

Complete Examples of Correlation Between Two Columns


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Discount':[1500,1000,1200,800,1300],
    'Duration':['30days','50days','30days','35days','40days']
          }
df = pd.DataFrame(technologies)
print(df)

# Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
print(corr)

# Correlation between all the columns of DataFrame
df2=df.corr()
print(df2)

# Other example.
df['Fee']=np.float64(df['Fee'])
df2=df.corr()

Frequently Asked Questions on Pandas Correlation of Columns

How do I calculate the correlation matrix for all columns in a DataFrame using Pandas?

To calculate the correlation matrix for all columns in a DataFrame using Pandas, you can use the corr() method.

How can I interpret the values in the correlation matrix?

The values in the correlation matrix range from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. Values in between represent varying degrees of correlation.

Can I calculate the correlation between specific columns in a DataFrame?

You can calculate the correlation between specific columns in a DataFrame in Pandas. After calculating the correlation matrix using the corr() method, you can extract individual correlation values for specific columns or column pairs.

What if I want to visualize the correlation matrix?

Visualizing the correlation matrix is a common practice to gain insights into the relationships between different columns in a DataFrame. You can use data visualization libraries like Seaborn and Matplotlib to create a heatmap of the correlation matrix.

Are there other correlation methods available in Pandas?

The corr method supports different correlation methods, such as Pearson (default), Kendall, and Spearman. You can specify the method using the method parameter.

Can correlation be used to imply causation between variables?

Correlation cannot be used to imply causation between variables. Correlation measures the statistical association or relationship between two variables, indicating whether changes in one variable are associated with changes in another. However, correlation does not provide information about the direction of causation or whether one variable causes the other.

Conclusion

In this article, you have learned how to get the correlation between two columns by using DataFrame.corr() method which can get positive and negative values between columns with several examples.

References