Calculate Summary Statistics in Pandas

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:October 30, 2022

How to perform Pandas summary statistics on DataFrame and Series? Pandas provide the describe() function to calculate the descriptive summary statistics. By default, this describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame.

1. Summary Statistics Functions

Following are different summary statistics functions provided in Pandas DataFrame and Series.

NumberSummary FunctionDescription
1abs()Calculated Absolute Value
2count()Count of Non-null Values
3cumsum()Cumulative Pum
4cumprod()Cumulative Product
5mean()Mean of Column Values
6median()Median of Values
7min()Minimum of Values
8max()Maximum of Values
9mode()Mode of Values
10sum()Sum of Column Values
11std()Standard Deviation of Values
12prod()Product of Values
Pandas Summary Statistic Functions

2. Pandas describe() Syntax & Usage

Following is the syntax of the describe() function to get descriptive summary statistics.


# Syntax of describe function
describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

2.1 Parameters & Return

  • percentiles: The percentiles to include in the output, the values should be between 0 and 1. By default, it takes values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles.
  • include: Applicable only to DataFrame which provides a white list of data types to include in the result.
  • exclude: Applicable only to DataFrame which provides a black list of data types to include in the result.
  • datetime_is_numeric: Whether to treat DateTime types as numeric. This affects the statistics calculated for the column.

Returns: It returns Summary statistics of the Series or Dataframe provided. Based on the object you use it either returns DataFrame or Series.

3. Pandas Summary Statistics using describe()

The Pandas describe() function calculates the Descriptive summary statistics of values by excluding NaN values from the DataFrame & Series. It by default provides summary statistics of all columns including both numeric and object types, and it provides an option to exclude or include columns in the summary results.

Let’s create a pandas DataFrame from the dict object.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

Yields below output.

pandas summary statistics

Now let’s perform the Pandas summary statistics by calling describe() function on DataFrame.


# Default describe
print(df.describe())

Yields below output. Notice that by default it gets the summary statistics for all numeric columns and the result contains aggregations like count, mean, std, min, different percentiles, and max.

pandas describe summary

4. Include All Columns in Summary Statistics

Sometimes you may want to calculate summary statistics for all columns/features including object types, you can achieve this by using the include='all' param to the describe() function.

Note that for object types it additionally calculates unique(), top(), and frequency(). For all numeric columns, the result of these is represented as NaN (missing values).


# Include All Columns
#print(df.describe(include='all'))

Yields below output.

pandas summary statistics

5. Calculate Summary Statistics of Selected Columns

You can also calculate descriptive summary statistics only on selected columns of Pandas DataFrame. For example, the below gets the results only for object types.


# Summary on selected columns
print(df[['Marks1','Marks2']].describe())

Yields below output.

6. Calculate For Selected Columns based on Data Type

By using the include param you can specify the column types you wanted to get the summary statistics for. The following example calculates the summary statistics for the only object column type.


#Include Object type
print(df.describe(include=['object']))

I will leave this to you to run and validate the result.

7. Exclude Multiple Columns Based on Data Type

Also, by using exclude param you can specify the column type you wanted to exclude from the pandas summary statistics. If you wanted to exclude a list of types, just specify them as a list. The below example omits all columns of type float and objects from the summary result.


# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

Yields below output.

pandas summary dataframe

8. Calculate Summary Statistics on Custom Percentile

If you notice above, all our examples get you percentiles for default values [.25, .5, .75] that return the 25th, 50th, and 75th percentiles. You can customize this by using the percentiles param. The below example returns the descriptive summary statistics of Pandas DataFrame with percentiles of 10th, 30th, 50th, and 70th.


# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Yields below output.

9. Aggregating Statistics Grouped by Category

Most of the time you would also need to calculate summary statistics for each grouped data. you can achieve this by grouping the data and then running the describe() function. In order to explain this I will use another DataFrame where I can perform the groupby on Pandas DataFrame.


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
    'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
    'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
          })
df = pd.DataFrame(technologies)
print(df)

# Pandas Get Statistics Using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)

Yields below output.

groupby statistics

10. Complete Example


# Create a DataFrame.
import pandas as pd
technologies   = ({
    'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
    'Marks1' :[80,85,90,95,72,83,85,95,98],
    'Marks2' :[90,93,99,81,82,95,94,98,88],
    'Marks3' :[76,85,89,82,92,93,96,84,73],
    'Marks4' :[95,98,74,85,97,83,77,75,89],
    'Attendance %'  :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
          })
df = pd.DataFrame(technologies)
print(df)

# Default summary statistics
print(df.describe())

# Include All Columns
print(df.describe(include='all'))

# Include selected columns
print(df[['Marks1','Marks2']].describe())

# Include Object type
print(df.describe(include=['object']))

# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))

# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))

Conclusion

In this article, you have learned different options to calculate the descriptive summary statistics in Pandas DataFrame. By default, the describe() function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame. This function also provides different param to include or exclude columns based on data types.

Related Articles

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Calculate Summary Statistics in Pandas