How to perform Pandas summary statistics on DataFrame and Series? Pandas provide the describe()
function to calculate the descriptive summary statistics. By default, this describe()
function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame.
Key Points –
- Pandas provides efficient methods to calculate summary statistics such as mean, median, mode, standard deviation, variance, minimum, maximum, and quantiles for numerical data.
- The
describe()
function in Pandas generates a descriptive summary of the data including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. - For specific summary statistics, Pandas offers individual functions like
mean()
,median()
,std()
,var()
,min()
,max()
,quantile()
, andsum()
which can be applied to columns or rows of a DataFrame. - By default, these functions operate column-wise, but with appropriate arguments, they can be applied row-wise or along specific axes.
- Pandas offers a comprehensive suite of functions for calculating summary statistics, facilitating efficient data exploration and analysis within DataFrame structures.
- Summary statistics computed by Pandas provide essential insights into the central tendency, dispersion, and distribution of numerical data, aiding in informed decision-making and hypothesis testing processes.
Table of contents
- 1. Summary Statistics Functions
- 2. Pandas describe() Syntax & Usage
- 3. Pandas Summary Statistics using describe()
- 4. Include All Columns in Summary Statistics
- 5. Calculate Summary Statistics of Selected Columns
- 6. Calculate For Selected Columns based on Data Type
- 7. Exclude Multiple Columns Based on Data Type
- 8. Calculate Summary Statistics on Custom Percentile
- 9. Aggregating Statistics Grouped by Category
- 10. Complete Example
1. Summary Statistics Functions
Following are different summary statistics functions provided in Pandas DataFrame and Series.
Number | Summary Function | Description |
---|---|---|
1 | abs() | Calculated Absolute Value |
2 | count() | Count of Non-null Values |
3 | cumsum() | Cumulative Pum |
4 | cumprod() | Cumulative Product |
5 | mean() | Mean of Column Values |
6 | median() | Median of Values |
7 | min() | Minimum of Values |
8 | max() | Maximum of Values |
9 | mode() | Mode of Values |
10 | sum() | Sum of Column Values |
11 | std() | Standard Deviation of Values |
12 | prod() | Product of Values |
2. Pandas describe() Syntax & Usage
Following is the syntax of the describe() function to get descriptive summary statistics.
# Syntax of describe function
describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
2.1 Parameters & Return
percentiles
: The percentiles to include in the output, the values should be between 0 and 1. By default, it takes values[.25, .5, .75]
that return the 25th, 50th, and 75th percentiles.include
: Applicable only to DataFrame which provides a white list of data types to include in the result.exclude
: Applicable only to DataFrame which provides a black list of data types to include in the result.datetime_is_numeric
: Whether to treat DateTime types as numeric. This affects the statistics calculated for the column.
Returns: It returns Summary statistics of the Series or Dataframe provided. Based on the object you use it either returns DataFrame or Series.
3. Pandas Summary Statistics using describe()
The Pandas describe()
function calculates the Descriptive summary statistics of values by excluding NaN
values from the DataFrame & Series. It by default provides summary statistics of all columns including both numeric and object types, and it provides an option to exclude or include columns in the summary results.
Let’s create a pandas DataFrame from the dict object.
# Create a DataFrame.
import pandas as pd
technologies = ({
'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
'Marks1' :[80,85,90,95,72,83,85,95,98],
'Marks2' :[90,93,99,81,82,95,94,98,88],
'Marks3' :[76,85,89,82,92,93,96,84,73],
'Marks4' :[95,98,74,85,97,83,77,75,89],
'Attendance %' :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
})
df = pd.DataFrame(technologies)
print(df)
Yields below output.
Now let’s perform the Pandas summary statistics by calling describe()
function on DataFrame.
# Default describe
print(df.describe())
Yields below output. Notice that by default it gets the summary statistics for all numeric columns and the result contains aggregations like count, mean, std, min, different percentiles, and max.
4. Include All Columns in Summary Statistics
Sometimes you may want to calculate summary statistics for all columns/features including object types, you can achieve this by using the include='all'
param to the describe()
function.
Note that for object types it additionally calculates unique()
, top()
, and frequency()
. For all numeric columns, the result of these is represented as NaN (missing values).
# Include All Columns
# print(df.describe(include='all'))
Yields below output.
5. Calculate Summary Statistics of Selected Columns
You can also calculate descriptive summary statistics only on selected columns of Pandas DataFrame. For example, the below gets the results only for object types.
# Summary on selected columns
print(df[['Marks1','Marks2']].describe())
Yields below output.
6. Calculate For Selected Columns based on Data Type
By using the include
param you can specify the column types you wanted to get the summary statistics for. The following example calculates the summary statistics for the only object column type.
# Include Object type
print(df.describe(include=['object']))
I will leave this to you to run and validate the result.
7. Exclude Multiple Columns Based on Data Type
Also, by using exclude
param you can specify the column type you wanted to exclude from the pandas summary statistics. If you wanted to exclude a list of types, just specify them as a list. The below example omits all columns of type float and objects from the summary result.
# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))
Yields below output.
8. Calculate Summary Statistics on Custom Percentile
If you notice above, all our examples get you percentiles for default values [.25, .5, .75]
that return the 25th, 50th, and 75th percentiles. You can customize this by using the percentiles
param. The below example returns the descriptive summary statistics of Pandas DataFrame with percentiles of 10th, 30th, 50th, and 70th.
# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))
Yields below output.
9. Aggregating Statistics Grouped by Category
Most of the time you would also need to calculate summary statistics for each grouped data. you can achieve this by grouping the data and then running the describe()
function. In order to explain this I will use another DataFrame where I can perform the groupby on Pandas DataFrame.
# Create a DataFrame.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Hadoop","Hadoop","Spark","Python","Spark"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,25000],
'Duration':['30days','50days','55days', '40days','55days','35days','30days','40days','40days'],
'Discount':[1000,2300,1000,1200,2500,1200,1400,1000,1200]
})
df = pd.DataFrame(technologies)
print(df)
# Pandas get statistics using groupby().describe()
df2=df.groupby(['Courses', 'Duration'])['Discount'].describe()
print(df2)
Yields below output.
10. Complete Example
# Create a DataFrame.
import pandas as pd
technologies = ({
'Student':["Raman","Chris","Anna","Debi","Cheng","Prabha","Srini","Creg","Hong"],
'Marks1' :[80,85,90,95,72,83,85,95,98],
'Marks2' :[90,93,99,81,82,95,94,98,88],
'Marks3' :[76,85,89,82,92,93,96,84,73],
'Marks4' :[95,98,74,85,97,83,77,75,89],
'Attendance %' :[0.98,0.87,0.92,0.98,0.87,0.97,0.99,0.97,0.90]
})
df = pd.DataFrame(technologies)
print(df)
# Default summary statistics
print(df.describe())
# Include All Columns
print(df.describe(include='all'))
# Include selected columns
print(df[['Marks1','Marks2']].describe())
# Include Object type
print(df.describe(include=['object']))
# Exclude Multiple Columns by Type
print(df.describe(exclude=['float','object']))
# Custom percentiles
print(df.describe(percentiles=[0.1, 0.3, 0.7]))
Frequently Asked Questions on Calculate Summary Statistics in Pandas
Summary statistics provide a concise overview of the data, including measures of central tendency, dispersion, and distribution. This helps in understanding the characteristics of the dataset without having to examine each data point individually.
You can use the describe()
method in Pandas DataFrame to generate summary statistics. It provides count, mean, standard deviation, minimum, quartiles, and maximum values for numeric columns.
You can calculate summary statistics for specific columns by selecting those columns before applying the describe()
method. This allows you to focus on the relevant attributes of your dataset.
By default, describe()
provides summary statistics only for numeric columns. However, you can specify the data types to include or exclude using the include
and exclude
parameters to tailor the summary to your needs.
Summary statistics in Pandas are affected by missing values in the DataFrame. When you calculate summary statistics using methods like describe()
, Pandas automatically excludes missing values (NaN) from the computation. This ensures that the summary statistics are based only on the available data in each column, providing an accurate representation of the dataset’s characteristics without being skewed by missing values. Therefore, it’s essential to handle missing values appropriately before calculating summary statistics to ensure the integrity and reliability of the analysis results.
While describe()
primarily focuses on numerical data, you can still generate summary statistics for categorical data by using methods such as value_counts()
to get frequency distributions or by creating custom functions tailored to your categorical variables.
Conclusion
In this article, you have learned different options to calculate the descriptive summary statistics in Pandas DataFrame. By default, the describe()
function calculates count, mean, std, min, different percentiles, and max on all numeric features or columns of the DataFrame. This function also provides different param to include or exclude columns based on data types.
Related Articles
- Pandas Get Statistics For Each Group?
- Pandas Groupby Aggregate Explained
- Pandas GroupBy Multiple Columns Explained
- Pandas Groupby Sort within Groups
- Pandas groupby() and count() with Examples
- Pandas groupby() and sum() With Examples
- Pandas Group Rows into List Using groupby()
- How to Change Column Name in Pandas
- How to GroupBy Index in Pandas?
- Pandas Get Total / Sum of Columns
- Pandas.DataFrame.mean() Examples
- Pandas Drop Level From Multi-Level Column Index