In pandas, the describe()
method is used to generate descriptive statistics of a DataFrame. This method provides a quick overview of the main statistics for each column of numerical data, such as count, mean, standard deviation, minimum, maximum, and the values at the 25th, 50th (median), and 75th percentiles. It can also be used for non-numerical data to provide statistics like count, unique values, top values, and frequency.
In this article, I will explain the Pandas DataFrame describe()
method by using its syntax, parameters, usage, and how to return the summary statistics of the provided Series or DataFrame.
Key Points –
- The
describe()
method generates descriptive statistics, including count, mean, standard deviation, minimum, quartiles, and maximum. - By default, it computes these statistics for numerical columns in the DataFrame.
- By specifying the
include=all
parameter,describe()
can include statistics for non-numerical data, providing insights such as count, unique values, top values, and frequency for categorical columns. - The method allows customization of which percentiles to compute using the
percentiles
parameter. - The
describe()
method can handle a variety of data types, including numerical, categorical, datetime, and boolean, making it a versatile tool for data exploration.
Pandas DataFrame describe() Introduction
Following is the syntax of the Pandas DataFrame describe()
# Syntax of Pandas DataFrame describe()
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False, bool_only=False)
Parameters of the DataFrame describe()
Following are the parameters of the DataFrame describe() function
percentiles
– A list of numbers between 0 and 1 to specify which percentiles to include. By default, it includes [0.25, 0.5, 0.75].include
– A list-like or string to specify the data types or columns to include. Default isNone
.exclude
– A list-like or string to specify the data types to exclude. Default isNone
.datetime_is_numeric
– A boolean indicating whether to treat datetime values as numeric. Default isFalse
.bool_only
– A boolean indicating whether to describe only boolean columns. Default isFalse
.
Return Value
It returns a DataFrame of descriptive statistics for the specified columns.
Usage of Pandas DataFrame describe() Method
The describe()
method in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. This method is especially useful for getting a quick overview of numerical data in a DataFrame.
To run some examples of pandas DataFrame describe() function, let’s create a Pandas DataFrame using data from a dictionary.
# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[25000,22000,26000,23000,30000],
'Discount':[800,1300,2000,1500,1000]
}
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)
Yields below output.
Now, let’s generate the summary statistics for the DataFrame by calling the describe()
function.
# Get descriptive statistics
df2 = df.describe()
print("Descriptive Statistics:\n", df2)
In the above example, describe()
provides summary statistics including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values for each numerical column.
Mixed DataFrame (Numerical and Categorical)
A Mixed DataFrame typically refers to a pandas DataFrame that contains a mix of numerical and categorical (non-numeric) columns.
# Get descriptive statistics for all columns
df2 = df.describe(include='all')
print("Get descriptive statistics for all columns:\n", df2)
# Output:
# Get descriptive statistics for all columns:
# Courses Fee Discount
# count 5 5.0000 5.000000
# unique 5 NaN NaN
# top PySpark NaN NaN
# freq 1 NaN NaN
# mean NaN 25200.0000 1320.000000
# std NaN 3114.4823 465.832588
# min NaN 22000.0000 800.000000
# 25% NaN 23000.0000 1000.000000
# 50% NaN 25000.0000 1300.000000
# 75% NaN 26000.0000 1500.000000
# max NaN 30000.0000 2000.000000
In the above example, you can use the describe()
method to get descriptive statistics for all columns, including both numerical and categorical. Use describe(include=all)
provides summary statistics for all columns, including count, unique values, the most frequent value (top
), and its frequency (freq
) for categorical data. For numerical data, it provides count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.
Creating a DataFrame with DateTime Data
The DataFrame with DateTime Data refers to a pandas DataFrame that includes columns containing datetime values. When working with datetime data in pandas, you can perform various operations and obtain descriptive statistics specific to datetime values.
import pandas as pd
import numpy as np
# Generate datetime data
dates = pd.date_range('20240101', periods=5)
data = {
'Date': dates,
'Value1': np.random.randn(5),
'Value2': np.random.randint(1, 100, 5)
}
# Create the DataFrame
df = pd.DataFrame(data)
# Get descriptive statistics
stats = df.describe()
print("Descriptive Statistics:\n", stats)
# Output:
# Descriptive Statistics:
# Value1 Value2
# count 5.000000 5.000000
# mean -0.260440 46.400000
# std 1.074673 22.875751
# min -1.877304 18.000000
# 25% -0.616400 37.000000
# 50% -0.165707 40.000000
# 75% 0.448233 59.000000
# max 0.908978 78.000000
In the above example, you can use the describe()
method to get descriptive statistics for numerical columns in the DataFrame, including datetime columns treated as numeric. Use describe()
provides summary statistics for the numerical columns Value1
and Value2
, including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.
Describe Excluding NaN Values
If you want to explicitly describe each column in the DataFrame while excluding NaN values.
# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark",np.nan,"Hadoop","Python","Pandas"],
'Fee' :[25000,22000,26000,np.nan,30000],
'Discount':[800,1300,np.nan,1500,1000]
}
df = pd.DataFrame(technologies)
# Describe each column excluding NaN values
df2= df.describe(exclude=[np.object])
print("Describe excluding nan values:\n", df2)
# Describe method excluding NaN values
df2 = df.describe()
print("Describe excluding nan values:\n",df2)
# Output:
# Describe excluding nan values:
# Fee Discount
# count 4.000000 4.000000
# mean 25750.000000 1150.000000
# std 3304.037934 310.912635
# min 22000.000000 800.000000
# 25% 24250.000000 950.000000
# 50% 25500.000000 1150.000000
# 75% 27000.000000 1350.000000
# max 30000.000000 1500.000000
In the above examples, df.describe(exclude=[np.object])
specifies that only numeric columns should be described, and NaN values are automatically excluded from the calculation of summary statistics like count, mean, std, etc.
FAQ on Pandas DataFrame describe() Method
The describe()
method in pandas generates descriptive statistics of the DataFrame. It provides summary statistics for numerical columns by default, such as count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th, and 75th percentiles.
The describe()
method excludes NaN values when calculating the summary statistics. The count reflects the number of non-NaN entries.
By default, describe()
provides summary statistics for numeric columns. However, you can include non-numeric columns by specifying include=all
.
To describe only specific types of columns in a pandas DataFrame, you can use the include
and exclude
parameters in the describe()
method.
You can get statistics for specific percentiles using the percentiles
parameter in the describe()
method. By default, describe()
calculates the 25th, 50th, and 75th percentiles. However, you can specify any percentiles you want.
Conclusion
In this article, you learned about the Pandas DataFrame describe()
function, including its syntax, parameters, and usage. You also saw how it returns a DataFrame containing summary statistics of the data, with various statistical measures based on the column types and specified parameters.
Happy Learning!!
Related Articles
- pandas.DataFrame.mean() Examples
- Pandas DataFrame assign() Method
- Pandas DataFrame insert() Function
- Pandas DataFrame sum() Method
- Pandas DataFrame corr() Method
- Pandas DataFrame clip() Method
- Pandas DataFrame median() Method
- Pandas groupby() and count() with Examples
- Pandas groupby() and sum() With Examples
- pandas rolling() Mean, Average, Sum Examples