In Pandas, the std()
method is used to calculate the standard deviation of the values in a DataFrame or a Series. The standard deviation measures the spread of data points relative to the mean and is useful for understanding the variability in your data.
In this article, I will explain the Pandas DataFrame std()
method and by using its syntax, parameters, and usage how we can return the sample standard deviation along the specified axis. By default, the standard deviation is calculated using normalization by N-1
. It measures the amount of variation or dispersion within a set of data values.
Key Points –
- The std() method computes the standard deviation of values along the specified axis of a DataFrame, which measures the dispersion or variability of the data.
- By default,
std()
calculates the standard deviation for each column (axis=0
), but it can also compute row-wise standard deviation by settingaxis=1
. - Excludes
NaN
(missing) values by default, but this behavior can be modified using theskipna
parameter. - The
ddof
parameter adjusts the divisor in the calculation, allowing for flexibility in statistical analysis. - The
std()
method can be applied to numeric-only data ifnumeric_only=True
, making it useful when working with DataFrames containing mixed data types.
Syntax of Pandas DataFrame std() Method
Let’s know the syntax of the std() method.
# Syntax of DataFrame std() method
DataFrame.std(axis=None, skipna=True, level=None, ddof=1, numeric_only=None)
Parameters of the DataFrame std()
Following are the parameters of the DataFrame std() method.
axis
– {0 or ‘index’, 1 or ‘columns’}, default 0. Axis for the function to be applied on.0
or'index'
: apply the function to each column.1
or'columns'
: apply the function to each row.
skipna
–bool
, defaultTrue
- Exclude NA/null values. If
True
, it skips NA/null values during the calculation.
- Exclude NA/null values. If
level
–int
orlevel name
, defaultNone
- If the axis is a MultiIndex (hierarchical), it calculates the standard deviation along a particular level, collapsing into a Series.
ddof
–int
, default1
- Delta Degrees of Freedom. The divisor used in the calculation is
N - ddof
, whereN
is the number of elements.
- Delta Degrees of Freedom. The divisor used in the calculation is
numeric_only
–bool
, defaultNone
- Include only
float
,int
, andboolean
data. IfNone
, it will try to use all data.
- Include only
Return Value
It returns the standard deviation of the values over the requested axis.
Usage of Pandas DataFrame std() Method
The std()
method in Pandas is used to compute the standard deviation of the values along the specified axis of a DataFrame. Standard deviation quantifies the level of variation or dispersion within a set of values.
To run some examples of pandas DataFrame std() method, let’s create a Pandas DataFrame using data from a dictionary.
import pandas as pd
# Creating a sample DataFrame
data = {
'A': [15, 38, 12, 24],
'B': [52, 31, 49, 11],
'C': [13, 22, 36, 18]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n",df)
Yields below output.
Standard Deviation for Each Column
You can compute the standard deviation for each column in the DataFrame by using the std()
method with its default settings (axis=0
). This calculates the standard deviation of the values in each column, ignoring any missing values.
# Calculate the standard deviation for each column
df2 = df.std()
print("Standard Deviation for each column:\n", df2)
Here,
- The standard deviation for column
A
is approximately10.97
. - The standard deviation for column
B
is approximately18.83
. - The standard deviation for column
C
is approximately9.70
.
Standard Deviation for Each Row
Alternatively, to calculate the standard deviation for each row in a Pandas DataFrame, you can use the std()
method with the parameter axis=1
. This computes the standard deviation across the values within each row, ignoring any missing values by default.
# Standard deviation for each row (axis=1)
df2 = df.std(axis=1)
print("Standard deviation for each row:\n", df2)
# Output:
# Standard deviation for each row:
# 0 21.962089
# 1 8.020806
# 2 18.770544
# 3 6.506407
# dtype: float64
Here,
Row 0
– Standard deviation is approximately20.50
.Row 1
– Standard deviation is approximately8.54
.Row 2
– Standard deviation is approximately18.11
.Row 3
– Standard deviation is approximately6.11
.
Using Delta Degrees of Freedom (ddof=0)
To calculate the standard deviation using delta degrees of freedom (ddof=0
), you set the ddof
parameter in the std()
method to 0
. By default, ddof
is set to 1
, which is used for sample standard deviation. Setting ddof=0
calculates the population standard deviation, which is appropriate when you have data representing the entire population rather than a sample.
# Standard deviation with ddof=0 (population standard deviation)
df2 = df.std(ddof=0)
print("Population standard deviation for each column:\n", df2)
# Output:
# Population standard deviation for each column:
# A 10.108783
# B 16.391690
# C 8.554969
# dtype: float64
Standard Deviation for Numeric Data Only
Similarly, when dealing with a DataFrame that contains a mix of numeric and non-numeric data, you can use the numeric_only=True
parameter with the std()
method to calculate the standard deviation only for numeric columns. This ensures that non-numeric data is excluded from the calculation.
import pandas as pd
# Creating a sample DataFrame with mixed data types
data = {
'A': [15, 38, 12, 24],
'B': [52, 31, 49, 11],
'C': [13, 22, 36, 18],
'D': ['C++', 'Java', 'Pandas', 'Hadoop']
}
df = pd.DataFrame(data)
# Standard deviation for numeric columns only
df2 = df.std(numeric_only=True)
print("Standard deviation for numeric columns only:\n", df2)
# Output:
# Standard deviation for numeric columns only:
# A 11.672618
# B 18.927493
# C 9.878428
# dtype: float64
Standard Deviation with Missing Values
Finally, when dealing with missing values (NaN
) in a DataFrame, the std()
method by default excludes these values from the calculation. However, you can control this behavior using the skipna
parameter.
Default Behavior, skipna=True (Ignoring Missing Values)
By default, the std()
method in Pandas ignores missing values (NaN
) when calculating the standard deviation. This means that missing values are excluded from the computation, and the standard deviation is calculated based on the available data in each column.
import pandas as pd
# Creating a sample DataFrame with missing values
data = {
'A': [15, 38, 12, None],
'B': [52, None, 49, 11],
'C': [13, 22, None, 18]
}
df = pd.DataFrame(data)
# Standard deviation with missing values skipped (default behavior)
df2 = df.std()
print("Standard deviation with missing values skipped:\n", df2)
# Standard deviation with missing values skipped
df2 = df.std(skipna=True)
print("Standard deviation with missing values skipped:\n", df2)
# Output:
# Standard deviation with missing values skipped:
# A 14.224392
# B 22.854613
# C 4.509250
# dtype: float64
Including Missing Values (skipna=False)
If you set skipna=False
and the DataFrame contains any NaN
values, the result will be NaN
for any column with missing values. This is because the presence of NaN
values will result in an undefined standard deviation for those columns.
# Standard deviation including missing values
df2 = df.std(skipna=False)
print("Standard deviation with missing values included:\n", df2)
# Output:
# Standard deviation with missing values included:
# A NaN
# B NaN
# C NaN
# dtype: float64
FAQ on Pandas DataFrame std() Method
The std()
method calculates the standard deviation of the values in each column (or row) of the DataFrame. By default, it computes the standard deviation for columns and excludes missing (NaN
) values.
By default, the std()
method calculates the standard deviation along the columns (axis=0
), skips missing values (skipna=True
), and uses sample standard deviation (ddof=1
).
For population standard deviation, set ddof=0
, which adjusts the divisor to be the total number of data points.
By default, missing (NaN
) values are ignored when calculating the standard deviation. If you want to include missing values in the calculation (which will result in NaN
for those columns), set skipna=False
.
The result is typically a Pandas Series containing the standard deviation for each column (or row). If calculating along specific levels or axes, the result type may vary depending on the structure of the DataFrame.
Conclusion
In summary, the Pandas std()
method is an effective tool for calculating standard deviation, providing insights into data variability across rows and columns in a DataFrame. By default, it calculates the standard deviation for columns but can be configured to operate on rows using axis=1
. Additionally, it handles missing values and offers flexibility with parameters such as skipna
, ddof
, and numeric_only
.
Happy Learning!!
Related Articles
- Pandas DataFrame cumsum() Method
- Pandas DataFrame max() Function
- Pandas DataFrame any() Method
- Pandas DataFrame diff() Method
- Pandas DataFrame round() Method
- Pandas DataFrame min() Method
- Pandas DataFrame cov() Method
- Pandas DataFrame ffill() Method
- Pandas DataFrame eval() Function
- Pandas DataFrame bfill() Method