pandas.DataFrame.corr() function can be used to get the correlation between two or more columns in DataFrame. Correlation is used to analyze the strength and direction between two quantitative variables. It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association and a negative value for r indicates a negative association
In this article, I will explain how to get the correlation between two columns with several examples.
Key Points –
- The
corr()
method is used to calculate the Pearson correlation coefficient between numerical columns in a DataFrame, which measures the linear relationship between columns. - Calling
df.corr()
on a DataFrame returns a correlation matrix that shows the pairwise correlation between all numeric columns. - By default,
corr()
calculates the Pearson correlation, which ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation. - Pandas
corr()
also supports other correlation methods, such as Kendall and Spearman, which can be specified using themethod
parameter. - You can compute the correlation between specific columns by selecting them before applying
corr()
, or by indexing into the correlation matrix. - The
corr()
method only works with numeric data, meaning it will ignore non-numeric columns during the correlation calculation.
Quick Examples of Correlation of Columns
If you are in hurry below are some quick examples of pandas correlation between two columns.
# Quick examples of correlation of columns
# Example 1: Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
# Example 2: Correlation between all the columns of DataFrame
df2=df.corr()
# Example 3: Other example
df['Discount']=np.float64(df['Fee'])
Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1500,1000,1200,800,1300],
'Duration':['30days','50days','30days','35days','40days']
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n",df)
Yields below output.
DataFrame corr() correlation Syntax
Following is the syntax of the DataFrame.corr() function.
# Syntax of corre()
DataFrame.corr(method='pearson', min_periods=1)
Correlation Between Two Columns of DataFrame
You can see the correlation between two columns of pandas DataFrame by using DataFrame.corr()
function. The pandas.DataFrame.corr()
is used to find the pairwise correlation of all columns in the DataFrame. For example, let’s see what is the correlation between Fee
and Discount
.
# Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
print("Correlation between two columns:\n",corr)
Yields below output.
We get -0.35 as the correlation between the scores of Fee and Discount. This indicates that the two columns highly correlated in a negative direction.
Correlation Between All the Columns of DataFrame
You can also get the correlation between all the columns of a pandas DataFrame. For this, apply corr()
function on the entire DataFrame which will result in a DataFrame of pair-wise correlation values between all the columns.
Note that by default, the corr()
function returns Pearson’s correlation.
# Correlation between all the columns of DataFrame.
df2=df.corr()
print(df2)
Yields below output.
# Output:
Fee Discount
Fee 1.000000 -0.351123
Discount -0.351123 1.000000
When applied to an entire DataFrame, the corr()
function returns a DataFrame of pair-wise correlation between the columns. We can see that there’s a weak negative correlation between scores of Fee/Discount. Also, notice that the values on the diagonal are 1s, this is because each column is perfectly correlated with itself.
Other Example
In this example, if Fee is float type, python skips it by default. All the other columns of DataFrame are in numpy-formats. so, you can do it by converting the column to np.float64
.
# Other example
df['Fee']=np.float64(df['Fee'])
print(df)
Complete Examples of Correlation Between Two Columns
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1500,1000,1200,800,1300],
'Duration':['30days','50days','30days','35days','40days']
}
df = pd.DataFrame(technologies)
print(df)
# Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
print(corr)
# Correlation between all the columns of DataFrame
df2=df.corr()
print(df2)
# Other example.
df['Fee']=np.float64(df['Fee'])
df2=df.corr()
Frequently Asked Questions on Pandas Correlation of Columns
To calculate the correlation matrix for all columns in a DataFrame using Pandas, you can use the corr()
method.
The values in the correlation matrix range from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. Values in between represent varying degrees of correlation.
You can calculate the correlation between specific columns in a DataFrame in Pandas. After calculating the correlation matrix using the corr()
method, you can extract individual correlation values for specific columns or column pairs.
Visualizing the correlation matrix is a common practice to gain insights into the relationships between different columns in a DataFrame. You can use data visualization libraries like Seaborn and Matplotlib to create a heatmap of the correlation matrix.
The corr
method supports different correlation methods, such as Pearson (default), Kendall, and Spearman. You can specify the method using the method
parameter.
Correlation cannot be used to imply causation between variables. Correlation measures the statistical association or relationship between two variables, indicating whether changes in one variable are associated with changes in another. However, correlation does not provide information about the direction of causation or whether one variable causes the other.
Conclusion
In this article, you have learned how to get the correlation between two columns by using DataFrame.corr()
method which can get positive and negative values between columns with several examples.
Related Articles
- Split pandas DataFrame
- Pandas Split Column into Two Columns
- How to Convert pandas Column to List
- Pandas Convert Datetime to Date Column
- Split Pandas DataFrame by column value
- Convert Pandas DataFrame to JSON String
- Pandas Find Row Values for Column Maximal
- Get Pandas DataFrame Columns by Data Type
- Pandas Add Column based on Another Column
- Pandas Check Column Contains a Value in DataFrame
- Count(Distinct) SQL Equivalent in Pandas DataFrame
- Create Test and Train Samples from Pandas DataFrame