• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:21 mins read
You are currently viewing Calculate Summary Statistics in Pandas

How to perform Pandas summary statistics on DataFrame and Series? Pandas provide the describe() function to calculate the descriptive summary statistics. By default, this describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame.

Key Points –

  • Pandas provides efficient methods to calculate summary statistics such as mean, median, mode, standard deviation, variance, minimum, maximum, and quantiles for numerical data.
  • The describe() function in Pandas generates a descriptive summary of the data including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.
  • For specific summary statistics, Pandas offers individual functions like mean(), median(), std(), var(), min(), max(), quantile(), and sum() which can be applied to columns or rows of a DataFrame.
  • By default, these functions operate column-wise, but with appropriate arguments, they can be applied row-wise or along specific axes.
  • Pandas offers a comprehensive suite of functions for calculating summary statistics, facilitating efficient data exploration and analysis within DataFrame structures.
  • Summary statistics computed by Pandas provide essential insights into the central tendency, dispersion, and distribution of numerical data, aiding in informed decision-making and hypothesis testing processes.

1. Summary Statistics Functions

Following are different summary statistics functions provided in Pandas DataFrame and Series.

NumberSummary FunctionDescription
1abs()Calculated Absolute Value
2count()Count of Non-null Values
3cumsum()Cumulative Pum
4cumprod()Cumulative Product
5mean()Mean of Column Values
6median()Median of Values
7min()Minimum of Values
8max()Maximum of Values
9mode()Mode of Values
10sum()Sum of Column Values
11std()Standard Deviation of Values
12prod()Product of Values
Pandas Summary Statistic Functions

2. Pandas describe() Syntax & Usage

Following is the syntax of the describe() function to get descriptive summary statistics.


# Syntax of describe function
describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

2.1 Parameters & Return

  • percentiles: The percentiles to include in the output, the values should be between 0 and 1. By default, it takes values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles.
  • include: Applicable only to DataFrame which provides a white list of data types to include in the result.
  • exclude: Applicable only to DataFrame which provides a black list of data types to include in the result.
  • datetime_is_numeric: Whether to treat DateTime types as numeric. This affects the statistics calculated for the column.

Returns: It returns Summary statistics of the Series or Dataframe provided. Based on the object you use it either returns DataFrame or Series.

3. Pandas Summary Statistics using describe()

The Pandas describe() function calculates the Descriptive summary statistics of values by excluding NaN values from the DataFrame & Series. It by default provides summary statistics of all columns including both numeric and object types, and it provides an option to exclude or include columns in the summary results.

Let’s create a pandas DataFrame from the dict object.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.

pandas summary statistics

Now let’s perform the Pandas summary statistics by calling describe() function on DataFrame.


# Default describe
print(df.describe())

Yields below output. Notice that by default it gets the summary statistics for all numeric columns and the result contains aggregations like count, mean, std, min, different percentiles, and max.

pandas describe summary

4. Include All Columns in Summary Statistics

Sometimes you may want to calculate summary statistics for all columns/features including object types, you can achieve this by using the include='all' param to the describe() function.

Note that for object types it additionally calculates unique(), top(), and frequency(). For all numeric columns, the result of these is represented as NaN (missing values).


# Include All Columns
# print(df.describe(include='all'))

Yields below output.

pandas summary statistics

5. Calculate Summary Statistics of Selected Columns

You can also calculate descriptive summary statistics only on selected columns of Pandas DataFrame. For example, the below gets the results only for object types.


# Summary on selected columns
print(df[['Marks1','Marks2']].describe())

Yields below output.

6. Calculate For Selected Columns based on Data Type

By using the include param you can specify the column types you wanted to get the summary statistics for. The following example calculates the summary statistics for the only object column type.


# Include Object type
print(df.describe(include=['object']))

I will leave this to you to run and validate the result.

7. Exclude Multiple Columns Based on Data Type

Also, by using exclude param you can specify the column type you wanted to exclude from the pandas summary statistics. If you wanted to exclude a list of types, just specify them as a list. The below example omits all columns of type float and objects from the summary result.


# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

Yields below output.

pandas summary dataframe

8. Calculate Summary Statistics on Custom Percentile

If you notice above, all our examples get you percentiles for default values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles. You can customize this by using the percentiles param. The below example returns the descriptive summary statistics of Pandas DataFrame with percentiles of 10th, 30th, 50th, and 70th.


# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Yields below output.

9. Aggregating Statistics Grouped by Category

Most of the time you would also need to calculate summary statistics for each grouped data. you can achieve this by grouping the data and then running the describe() function. In order to explain this I will use another DataFrame where I can perform the groupby on Pandas DataFrame.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
          })
df = pd.DataFrame(technologies)
print(df)

# Pandas get statistics using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)

Yields below output.

groupby statistics

10. Complete Example


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

# Default summary statistics
print(df.describe())

# Include All Columns
print(df.describe(include='all'))

# Include selected columns
print(df[['Marks1','Marks2']].describe())

# Include Object type
print(df.describe(include=['object']))

# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Frequently Asked Questions on Calculate Summary Statistics in Pandas

What is the purpose of calculating summary statistics in Pandas?

Summary statistics provide a concise overview of the data, including measures of central tendency, dispersion, and distribution. This helps in understanding the characteristics of the dataset without having to examine each data point individually.

How can I calculate summary statistics for a DataFrame in Pandas?

You can use the describe() method in Pandas DataFrame to generate summary statistics. It provides count, mean, standard deviation, minimum, quartiles, and maximum values for numeric columns.

Can I calculate summary statistics for specific columns only?

You can calculate summary statistics for specific columns by selecting those columns before applying the describe() method. This allows you to focus on the relevant attributes of your dataset.

What if my DataFrame contains non-numeric columns?

By default, describe() provides summary statistics only for numeric columns. However, you can specify the data types to include or exclude using the include and exclude parameters to tailor the summary to your needs.

Are summary statistics affected by missing values in the DataFrame?

Summary statistics in Pandas are affected by missing values in the DataFrame. When you calculate summary statistics using methods like describe(), Pandas automatically excludes missing values (NaN) from the computation. This ensures that the summary statistics are based only on the available data in each column, providing an accurate representation of the dataset’s characteristics without being skewed by missing values. Therefore, it’s essential to handle missing values appropriately before calculating summary statistics to ensure the integrity and reliability of the analysis results.

Can I calculate summary statistics for categorical data?

While describe() primarily focuses on numerical data, you can still generate summary statistics for categorical data by using methods such as value_counts() to get frequency distributions or by creating custom functions tailored to your categorical variables.

Conclusion

In this article, you have learned different options to calculate the descriptive summary statistics in Pandas DataFrame. By default, the describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame. This function also provides different param to include or exclude columns based on data types.

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium