• Post author:
  • Post category:Pandas
  • Post last modified:September 6, 2024
  • Reading time:18 mins read
You are currently viewing Pandas DataFrame std() Method

In Pandas, the std() method is used to calculate the standard deviation of the values in a DataFrame or a Series. The standard deviation measures the spread of data points relative to the mean and is useful for understanding the variability in your data.

Advertisements

In this article, I will explain the Pandas DataFrame std() method and by using its syntax, parameters, and usage how we can return the sample standard deviation along the specified axis. By default, the standard deviation is calculated using normalization by N-1. It measures the amount of variation or dispersion within a set of data values.

Key Points –

  • The std() method computes the standard deviation of values along the specified axis of a DataFrame, which measures the dispersion or variability of the data.
  • By default, std() calculates the standard deviation for each column (axis=0), but it can also compute row-wise standard deviation by setting axis=1.
  • Excludes NaN (missing) values by default, but this behavior can be modified using the skipna parameter.
  • The ddof parameter adjusts the divisor in the calculation, allowing for flexibility in statistical analysis.
  • The std() method can be applied to numeric-only data if numeric_only=True, making it useful when working with DataFrames containing mixed data types.

Syntax of Pandas DataFrame std() Method

Let’s know the syntax of the std() method.


# Syntax of DataFrame std() method
DataFrame.std(axis=None, skipna=True, level=None, ddof=1, numeric_only=None)

Parameters of the DataFrame std()

Following are the parameters of the DataFrame std() method.

  • axis – {0 or ‘index’, 1 or ‘columns’}, default 0. Axis for the function to be applied on.
    • 0 or 'index': apply the function to each column.
    • 1 or 'columns': apply the function to each row.
  • skipnabool, default True
    • Exclude NA/null values. If True, it skips NA/null values during the calculation.
  • levelint or level name, default None
    • If the axis is a MultiIndex (hierarchical), it calculates the standard deviation along a particular level, collapsing into a Series.
  • ddofint, default 1
    • Delta Degrees of Freedom. The divisor used in the calculation is N - ddof, where N is the number of elements.
  • numeric_onlybool, default None
    • Include only float, int, and boolean data. If None, it will try to use all data.

Return Value

It returns the standard deviation of the values over the requested axis.

Usage of Pandas DataFrame std() Method

The std() method in Pandas is used to compute the standard deviation of the values along the specified axis of a DataFrame. Standard deviation quantifies the level of variation or dispersion within a set of values.

To run some examples of pandas DataFrame std() method, let’s create a Pandas DataFrame using data from a dictionary.


import pandas as pd

# Creating a sample DataFrame
data = {
    'A': [15, 38, 12, 24],
    'B': [52, 31, 49, 11],
    'C': [13, 22, 36, 18]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n",df)

Yields below output.

pandas std

Standard Deviation for Each Column

You can compute the standard deviation for each column in the DataFrame by using the std() method with its default settings (axis=0). This calculates the standard deviation of the values in each column, ignoring any missing values.


# Calculate the standard deviation for each column
df2 = df.std()
print("Standard Deviation for each column:\n", df2)

Here,

  • The standard deviation for column A is approximately 10.97.
  • The standard deviation for column B is approximately 18.83.
  • The standard deviation for column C is approximately 9.70.
pandas std

Standard Deviation for Each Row

Alternatively, to calculate the standard deviation for each row in a Pandas DataFrame, you can use the std() method with the parameter axis=1. This computes the standard deviation across the values within each row, ignoring any missing values by default.


# Standard deviation for each row (axis=1)
df2 = df.std(axis=1)
print("Standard deviation for each row:\n", df2)

# Output:
# Standard deviation for each row:
# 0    21.962089
# 1     8.020806
# 2    18.770544
# 3     6.506407
# dtype: float64

Here,

  • Row 0 – Standard deviation is approximately 20.50.
  • Row 1 – Standard deviation is approximately 8.54.
  • Row 2 – Standard deviation is approximately 18.11.
  • Row 3 – Standard deviation is approximately 6.11.

Using Delta Degrees of Freedom (ddof=0)

To calculate the standard deviation using delta degrees of freedom (ddof=0), you set the ddof parameter in the std() method to 0. By default, ddof is set to 1, which is used for sample standard deviation. Setting ddof=0 calculates the population standard deviation, which is appropriate when you have data representing the entire population rather than a sample.


# Standard deviation with ddof=0 (population standard deviation)
df2 = df.std(ddof=0)
print("Population standard deviation for each column:\n", df2)

# Output:
# Population standard deviation for each column:
# A    10.108783
# B    16.391690
# C     8.554969
# dtype: float64

Standard Deviation for Numeric Data Only

Similarly, when dealing with a DataFrame that contains a mix of numeric and non-numeric data, you can use the numeric_only=True parameter with the std() method to calculate the standard deviation only for numeric columns. This ensures that non-numeric data is excluded from the calculation.


import pandas as pd

# Creating a sample DataFrame with mixed data types
data = {
    'A': [15, 38, 12, 24],
    'B': [52, 31, 49, 11],
    'C': [13, 22, 36, 18],
    'D': ['C++', 'Java', 'Pandas', 'Hadoop']
}

df = pd.DataFrame(data)

# Standard deviation for numeric columns only
df2 = df.std(numeric_only=True)
print("Standard deviation for numeric columns only:\n", df2)

# Output:
# Standard deviation for numeric columns only:
# A    11.672618
# B    18.927493
# C     9.878428
# dtype: float64

Standard Deviation with Missing Values

Finally, when dealing with missing values (NaN) in a DataFrame, the std() method by default excludes these values from the calculation. However, you can control this behavior using the skipna parameter.

Default Behavior, skipna=True (Ignoring Missing Values)

By default, the std() method in Pandas ignores missing values (NaN) when calculating the standard deviation. This means that missing values are excluded from the computation, and the standard deviation is calculated based on the available data in each column.


import pandas as pd

# Creating a sample DataFrame with missing values
data = {
    'A': [15, 38, 12, None],
    'B': [52, None, 49, 11],
    'C': [13, 22, None, 18]
}

df = pd.DataFrame(data)

# Standard deviation with missing values skipped (default behavior)
df2 = df.std()
print("Standard deviation with missing values skipped:\n", df2)

# Standard deviation with missing values skipped 
df2 = df.std(skipna=True)
print("Standard deviation with missing values skipped:\n", df2)

# Output:
# Standard deviation with missing values skipped:
# A    14.224392
# B    22.854613
# C     4.509250
# dtype: float64

Including Missing Values (skipna=False)

If you set skipna=False and the DataFrame contains any NaN values, the result will be NaN for any column with missing values. This is because the presence of NaN values will result in an undefined standard deviation for those columns.


# Standard deviation including missing values
df2 = df.std(skipna=False)
print("Standard deviation with missing values included:\n", df2)

# Output:
# Standard deviation with missing values included:
# A   NaN
# B   NaN
# C   NaN
# dtype: float64

FAQ on Pandas DataFrame std() Method

What does the std() method do in a Pandas DataFrame?

The std() method calculates the standard deviation of the values in each column (or row) of the DataFrame. By default, it computes the standard deviation for columns and excludes missing (NaN) values.

What is the default behavior of the std() method?

By default, the std() method calculates the standard deviation along the columns (axis=0), skips missing values (skipna=True), and uses sample standard deviation (ddof=1).

How can I calculate the population standard deviation?

For population standard deviation, set ddof=0, which adjusts the divisor to be the total number of data points.

What happens if my DataFrame contains missing values?

By default, missing (NaN) values are ignored when calculating the standard deviation. If you want to include missing values in the calculation (which will result in NaN for those columns), set skipna=False.

What is the result type of the std() method?

The result is typically a Pandas Series containing the standard deviation for each column (or row). If calculating along specific levels or axes, the result type may vary depending on the structure of the DataFrame.

Conclusion

In summary, the Pandas std() method is an effective tool for calculating standard deviation, providing insights into data variability across rows and columns in a DataFrame. By default, it calculates the standard deviation for columns but can be configured to operate on rows using axis=1. Additionally, it handles missing values and offers flexibility with parameters such as skipna, ddof, and numeric_only.

Happy Learning!!

Reference