Pandas DataFrame describe() Method

In pandas, the describe() method is used to generate descriptive statistics of a DataFrame. This method provides a quick overview of the main statistics for each column of numerical data, such as count, mean, standard deviation, minimum, maximum, and the values at the 25th, 50th (median), and 75th percentiles. It can also be used for non-numerical data to provide statistics like count, unique values, top values, and frequency.

Pandas DataFrame describe() Introduction

Following is the syntax of the Pandas DataFrame describe()


# Syntax of Pandas DataFrame describe()
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False, bool_only=False)

Parameters of the DataFrame describe()

Following are the parameters of the DataFrame describe() function

percentiles – A list of numbers between 0 and 1 to specify which percentiles to include. By default, it includes [0.25, 0.5, 0.75].
include – A list-like or string to specify the data types or columns to include. Default is None.
exclude – A list-like or string to specify the data types to exclude. Default is None.
datetime_is_numeric – A boolean indicating whether to treat datetime values as numeric. Default is False.
bool_only – A boolean indicating whether to describe only boolean columns. Default is False.

Return Value

It returns a DataFrame of descriptive statistics for the specified columns.

Usage of Pandas DataFrame describe() Method

The describe() method in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. This method is especially useful for getting a quick overview of numerical data in a DataFrame.

To run some examples of pandas DataFrame describe() function, let’s create a Pandas DataFrame using data from a dictionary.


# Create a pandas DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[25000,22000,26000,23000,30000],
    'Discount':[800,1300,2000,1500,1000]
          }
df = pd.DataFrame(technologies)
print("Original DataFrame:\n",df)

Yields below output.

Now, let’s generate the summary statistics for the DataFrame by calling the describe() function.


# Get descriptive statistics
df2 = df.describe()
print("Descriptive Statistics:\n", df2)

In the above example, describe() provides summary statistics including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values for each numerical column.

Mixed DataFrame (Numerical and Categorical)

A Mixed DataFrame typically refers to a pandas DataFrame that contains a mix of numerical and categorical (non-numeric) columns.


# Get descriptive statistics for all columns
df2 = df.describe(include='all')
print("Get descriptive statistics for all columns:\n", df2)

# Output:
# Get descriptive statistics for all columns:
#          Courses         Fee     Discount
# count         5      5.0000     5.000000
# unique        5         NaN          NaN
# top     PySpark         NaN          NaN
# freq          1         NaN          NaN
# mean        NaN  25200.0000  1320.000000
# std         NaN   3114.4823   465.832588
# min         NaN  22000.0000   800.000000
# 25%         NaN  23000.0000  1000.000000
# 50%         NaN  25000.0000  1300.000000
# 75%         NaN  26000.0000  1500.000000
# max         NaN  30000.0000  2000.000000

In the above example, you can use the describe() method to get descriptive statistics for all columns, including both numerical and categorical. Use describe(include=all) provides summary statistics for all columns, including count, unique values, the most frequent value (top), and its frequency (freq) for categorical data. For numerical data, it provides count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

Creating a DataFrame with DateTime Data

The DataFrame with DateTime Data refers to a pandas DataFrame that includes columns containing datetime values. When working with datetime data in pandas, you can perform various operations and obtain descriptive statistics specific to datetime values.


import pandas as pd
import numpy as np

# Generate datetime data
dates = pd.date_range('20240101', periods=5)
data = {
    'Date': dates,
    'Value1': np.random.randn(5),
    'Value2': np.random.randint(1, 100, 5)
}

# Create the DataFrame
df = pd.DataFrame(data)

# Get descriptive statistics
stats = df.describe()
print("Descriptive Statistics:\n", stats)

# Output:
# Descriptive Statistics:
#           Value1     Value2
# count  5.000000   5.000000
# mean  -0.260440  46.400000
# std    1.074673  22.875751
# min   -1.877304  18.000000
# 25%   -0.616400  37.000000
# 50%   -0.165707  40.000000
# 75%    0.448233  59.000000
# max    0.908978  78.000000

In the above example, you can use the describe() method to get descriptive statistics for numerical columns in the DataFrame, including datetime columns treated as numeric. Use describe() provides summary statistics for the numerical columns Value1 and Value2, including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

Describe Excluding NaN Values

If you want to explicitly describe each column in the DataFrame while excluding NaN values.


# Create a pandas DataFrame
import pandas as pd
import numpy as np

technologies= {
    'Courses':["Spark",np.nan,"Hadoop","Python","Pandas"],
    'Fee' :[25000,22000,26000,np.nan,30000],
    'Discount':[800,1300,np.nan,1500,1000]
          }
df = pd.DataFrame(technologies)

# Describe each column excluding NaN values
df2= df.describe(exclude=[np.object])
print("Describe excluding nan values:\n", df2)

# Describe method excluding NaN values
df2 = df.describe()
print("Describe excluding nan values:\n",df2)

# Output:
# Describe excluding nan values:
#                  Fee     Discount
# count      4.000000     4.000000
# mean   25750.000000  1150.000000
# std     3304.037934   310.912635
# min    22000.000000   800.000000
# 25%    24250.000000   950.000000
# 50%    25500.000000  1150.000000
# 75%    27000.000000  1350.000000
# max    30000.000000  1500.000000

In the above examples, df.describe(exclude=[np.object]) specifies that only numeric columns should be described, and NaN values are automatically excluded from the calculation of summary statistics like count, mean, std, etc.

FAQ on Pandas DataFrame describe() Method

What does the describe() method do?

The describe() method in pandas generates descriptive statistics of the DataFrame. It provides summary statistics for numerical columns by default, such as count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th, and 75th percentiles.

Does describe() handle NaN values?

The describe() method excludes NaN values when calculating the summary statistics. The count reflects the number of non-NaN entries.

Can describe() provide statistics for non-numeric columns?

By default, describe() provides summary statistics for numeric columns. However, you can include non-numeric columns by specifying include=all.

How can I describe only specific types of columns?

To describe only specific types of columns in a pandas DataFrame, you can use the include and exclude parameters in the describe() method.

Can I get statistics for specific percentiles?

You can get statistics for specific percentiles using the percentiles parameter in the describe() method. By default, describe() calculates the 25th, 50th, and 75th percentiles. However, you can specify any percentiles you want.

Conclusion

In this article, you learned about the Pandas DataFrame describe() function, including its syntax, parameters, and usage. You also saw how it returns a DataFrame containing summary statistics of the data, with various statistical measures based on the column types and specified parameters.

Happy Learning!!

Reference

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html