Calculate Summary Statistics in Pandas

How to perform Pandas summary statistics on DataFrame and Series? Pandas provide the describe() function to calculate the descriptive summary statistics. By default, this describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame.

1. Summary Statistics Functions

Following are different summary statistics functions provided in Pandas DataFrame and Series.

Number	Summary Function	Description
1	abs()	Calculated Absolute Value
2	count()	Count of Non-null Values
3	cumsum()	Cumulative Pum
4	cumprod()	Cumulative Product
5	mean()	Mean of Column Values
6	median()	Median of Values
7	min()	Minimum of Values
8	max()	Maximum of Values
9	mode()	Mode of Values
10	sum()	Sum of Column Values
11	std()	Standard Deviation of Values
12	prod()	Product of Values

Pandas Summary Statistic Functions

2. Pandas describe() Syntax & Usage

Following is the syntax of the describe() function to get descriptive summary statistics.


# Syntax of describe function
describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

2.1 Parameters & Return

percentiles: The percentiles to include in the output, the values should be between 0 and 1. By default, it takes values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles.
include: Applicable only to DataFrame which provides a white list of data types to include in the result.
exclude: Applicable only to DataFrame which provides a black list of data types to include in the result.
datetime_is_numeric: Whether to treat DateTime types as numeric. This affects the statistics calculated for the column.

Returns: It returns Summary statistics of the Series or Dataframe provided. Based on the object you use it either returns DataFrame or Series.

3. Pandas Summary Statistics using describe()

The Pandas describe() function calculates the Descriptive summary statistics of values by excluding NaN values from the DataFrame & Series. It by default provides summary statistics of all columns including both numeric and object types, and it provides an option to exclude or include columns in the summary results.

Let’s create a pandas DataFrame from the dict object.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.

Now let’s perform the Pandas summary statistics by calling describe() function on DataFrame.


# Default describe
print(df.describe())

Yields below output. Notice that by default it gets the summary statistics for all numeric columns and the result contains aggregations like count, mean, std, min, different percentiles, and max.

4. Include All Columns in Summary Statistics

Sometimes you may want to calculate summary statistics for all columns/features including object types, you can achieve this by using the include='all' param to the describe() function.

Note that for object types it additionally calculates unique(), top(), and frequency(). For all numeric columns, the result of these is represented as NaN (missing values).


# Include All Columns
# print(df.describe(include='all'))

Yields below output.

5. Calculate Summary Statistics of Selected Columns

You can also calculate descriptive summary statistics only on selected columns of Pandas DataFrame. For example, the below gets the results only for object types.


# Summary on selected columns
print(df[['Marks1','Marks2']].describe())

Yields below output.

6. Calculate For Selected Columns based on Data Type

By using the include param you can specify the column types you wanted to get the summary statistics for. The following example calculates the summary statistics for the only object column type.


# Include Object type
print(df.describe(include=['object']))

I will leave this to you to run and validate the result.

7. Exclude Multiple Columns Based on Data Type

Also, by using exclude param you can specify the column type you wanted to exclude from the pandas summary statistics. If you wanted to exclude a list of types, just specify them as a list. The below example omits all columns of type float and objects from the summary result.


# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

Yields below output.

8. Calculate Summary Statistics on Custom Percentile

If you notice above, all our examples get you percentiles for default values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles. You can customize this by using the percentiles param. The below example returns the descriptive summary statistics of Pandas DataFrame with percentiles of 10th, 30th, 50th, and 70th.


# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Yields below output.

9. Aggregating Statistics Grouped by Category

Most of the time you would also need to calculate summary statistics for each grouped data. you can achieve this by grouping the data and then running the describe() function. In order to explain this I will use another DataFrame where I can perform the groupby on Pandas DataFrame.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
          })
df = pd.DataFrame(technologies)
print(df)

# Pandas get statistics using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)

Yields below output.

10. Complete Example


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

# Default summary statistics
print(df.describe())

# Include All Columns
print(df.describe(include='all'))

# Include selected columns
print(df[['Marks1','Marks2']].describe())

# Include Object type
print(df.describe(include=['object']))

# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Frequently Asked Questions on Calculate Summary Statistics in Pandas

What is the purpose of calculating summary statistics in Pandas?

Summary statistics provide a concise overview of the data, including measures of central tendency, dispersion, and distribution. This helps in understanding the characteristics of the dataset without having to examine each data point individually.

How can I calculate summary statistics for a DataFrame in Pandas?

You can use the describe() method in Pandas DataFrame to generate summary statistics. It provides count, mean, standard deviation, minimum, quartiles, and maximum values for numeric columns.

Can I calculate summary statistics for specific columns only?

You can calculate summary statistics for specific columns by selecting those columns before applying the describe() method. This allows you to focus on the relevant attributes of your dataset.

What if my DataFrame contains non-numeric columns?

By default, describe() provides summary statistics only for numeric columns. However, you can specify the data types to include or exclude using the include and exclude parameters to tailor the summary to your needs.

Are summary statistics affected by missing values in the DataFrame?

Summary statistics in Pandas are affected by missing values in the DataFrame. When you calculate summary statistics using methods like describe(), Pandas automatically excludes missing values (NaN) from the computation. This ensures that the summary statistics are based only on the available data in each column, providing an accurate representation of the dataset’s characteristics without being skewed by missing values. Therefore, it’s essential to handle missing values appropriately before calculating summary statistics to ensure the integrity and reliability of the analysis results.

Can I calculate summary statistics for categorical data?

While describe() primarily focuses on numerical data, you can still generate summary statistics for categorical data by using methods such as value_counts() to get frequency distributions or by creating custom functions tailored to your categorical variables.

Conclusion

In this article, you have learned different options to calculate the descriptive summary statistics in Pandas DataFrame. By default, the describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame. This function also provides different param to include or exclude columns based on data types.

References

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html

Table of contents

1. Summary Statistics Functions

2. Pandas describe() Syntax & Usage

2.1 Parameters & Return

3. Pandas Summary Statistics using describe()

4. Include All Columns in Summary Statistics

5. Calculate Summary Statistics of Selected Columns

6. Calculate For Selected Columns based on Data Type

7. Exclude Multiple Columns Based on Data Type

8. Calculate Summary Statistics on Custom Percentile

9. Aggregating Statistics Grouped by Category

10. Complete Example

Frequently Asked Questions on Calculate Summary Statistics in Pandas

Conclusion

Related Articles

References