pandas.DataFrame.corr() function can be used to get the correlation between two or more columns in DataFrame. Correlation is used to analyze the strength and direction between two quantitative variables. It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association and a negative value for r indicates a negative association
In this article, I will explain how to get the correlation between two columns with several examples.
1. Quick Examples of Correlation of Columns
If you are in hurry below are some quick examples of pandas correlation between two columns.
# Quick examples of correlation of columns
# Example 1: Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
# Example 2: Correlation between all the columns of DataFrame
df2=df.corr()
# Example 3: Other example
df['Discount']=np.float64(df['Fee'])
Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names Courses
, Fee
and Duration
.
# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1500,1000,1200,800,1300],
'Duration':['30days','50days','30days','35days','40days']
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n",df)
Yields below output.
2. DataFrame corr() correlation Syntax
Following is the syntax of the DataFrame.corr() function.
# Syntax of corre()
DataFrame.corr(method='pearson', min_periods=1)
3. Correlation Between Two Columns of DataFrame
You can see the correlation between two columns of pandas DataFrame by using DataFrame.corr()
function. The pandas.DataFrame.corr()
is used to find the pairwise correlation of all columns in the DataFrame. For example, let’s see what is the correlation between Fee
and Discount
.
# Correlation between two columns of DataFrame
corr=df['Fee'].corr(df['Discount'])
print("Correlation between two columns:\n",corr)
Yields below output.
We get -0.35 as the correlation between the scores of Fee and Discount. This indicates that the two columns highly correlated in a negative direction.
4. Correlation Between All the Columns of DataFrame
You can also get the correlation between all the columns of a pandas DataFrame. For this, apply corr()
function on the entire DataFrame which will result in a DataFrame of pair-wise correlation values between all the columns.
Note that by default, the corr()
function returns Pearson’s correlation.
# Correlation between all the columns of DataFrame.
df2=df.corr()
print(df2)
Yields below output.
# Output:
Fee Discount
Fee 1.000000 -0.351123
Discount -0.351123 1.000000
When applied to an entire DataFrame, the corr()
function returns a DataFrame of pair-wise correlation between the columns. We can see that there’s a weak negative correlation between scores of Fee/Discount. Also, notice that the values on the diagonal are 1s, this is because each column is perfectly correlated with itself.
5. Other Example
In this example, if Fee is float type, python skips it by default. All the other columns of DataFrame are in numpy-formats. so, you can do it by converting the column to np.float64
.
# Other example
df['Fee']=np.float64(df['Fee'])
print(df)
6. Complete Examples of Correlation Between Two Columns
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Spark","Python","PySpark"],
'Fee' :[22000,25000,23000,24000,26000],
'Discount':[1500,1000,1200,800,1300],
'Duration':['30days','50days','30days','35days','40days']
}
df = pd.DataFrame(technologies)
print(df)
# Correlation between two columns of DataFrame.
corr=df['Fee'].corr(df['Discount'])
print(corr)
# Correlation between all the columns of DataFrame.
df2=df.corr()
print(df2)
# Other example.
df['Fee']=np.float64(df['Fee'])
df2=df.corr()
Frequently Asked Questions on Pandas Correlation of Columns
To calculate the correlation matrix for all columns in a DataFrame using Pandas, you can use the corr()
method.
The values in the correlation matrix range from -1 to 1. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation. Values in between represent varying degrees of correlation.
You can calculate the correlation between specific columns in a DataFrame in Pandas. After calculating the correlation matrix using the corr()
method, you can extract individual correlation values for specific columns or column pairs.
Visualizing the correlation matrix is a common practice to gain insights into the relationships between different columns in a DataFrame. You can use data visualization libraries like Seaborn and Matplotlib to create a heatmap of the correlation matrix.
The corr
method supports different correlation methods, such as Pearson (default), Kendall, and Spearman. You can specify the method using the method
parameter.
Correlation cannot be used to imply causation between variables. Correlation measures the statistical association or relationship between two variables, indicating whether changes in one variable are associated with changes in another. However, correlation does not provide information about the direction of causation or whether one variable causes the other.
Conclusion
In this article, you have learned how to get the correlation between two columns by using DataFrame.corr()
method which can get positive and negative values between columns with several examples.
Related Articles
- Split pandas DataFrame
- Pandas Convert Datetime to Date Column
- Convert Pandas DataFrame to JSON String
- How to Convert pandas Column to List
- Pandas Add Column based on Another Column
- Pandas Split Column into Two Columns
- Pandas Check Column Contains a Value in DataFrame
- Split Pandas DataFrame by column value
- Count(Distinct) SQL Equivalent in Pandas DataFrame
- Get Pandas DataFrame Columns by Data Type
- Pandas Find Row Values for Column Maximal
- Create Test and Train Samples from Pandas DataFrame