• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:19 mins read
You are currently viewing Pandas Normalize Columns of DataFrame

To Normalize columns of pandas DataFrame we have to learn some concepts first.

Data Normalization: Data Normalization is a typical practice in machine learning which consists of transforming numeric columns to a standard scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process hence we need to normalize the values before running any machine learning algorithms.

Key Points –

  • Normalizing columns in a Pandas DataFrame involves scaling the values within each column to a common range, facilitating fair comparisons and analysis.
  • Normalizing columns, such as through mean normalization in Pandas, involves centering the data around a statistical measure (mean), facilitating analyses that require a zero-centered distribution.
  • Normalization helps prevent bias in analyses where the magnitude of values in different columns might disproportionately influence results, ensuring a more balanced evaluation.
  • In machine learning, normalizing features can enhance the performance of models, especially those sensitive to the scale of input variables.
  • Applying normalization techniques preserves the integrity of the original data while providing a standardized representation for analytical purposes.
  • Normalizing columns can contribute to improved convergence in optimization algorithms and numerical procedures, enhancing the stability and efficiency of data processing tasks.

1. Quick Examples of Normalize Columns of DataFrame

If you’re in a hurry below are quick examples of normalize columns of Pandas DataFrame.


# Quick examples of normalize columns

# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()

# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)

# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())

# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)

# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)

# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)

# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)

Now, let’s create a pandas DataFrame and execute these examples and validate results. Our DataFrame contains column names  Fee and Discount.


import pandas as pd
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)

Yields below Output:


# Output:
    Fee  Discount
0  1000       400
1  2000       500
2  3000       600

2. Pandas Normalize Using Mean Normalization

To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates.


# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
print(normalized_df)

Yields below Output:


# Output:
   Fee  Discount
0 -1.0      -1.0
1  0.0       0.0
2  1.0       1.0

Alternatively, you can also get the same using DataFrame.apply() and lambda.


# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(normalized_df)

Yields the same output as above.

2. Pandas Normalize Using Min/Max Normalization

Alternatively, you can also normalize columns using min/max normalization. Your Min-Max normalization formula (df - df.min()) / (df.max() - df.min()) is correctly applied column-wise, scaling the values in each column to the range [0, 1]. The resulting normalized_df DataFrame contains the Min-Max normalized values.


# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
print(normalized_df)

Yields below Output:


# Output:
   Fee  Discount
0  0.0       0.0
1  0.5       0.5
2  1.0       1.0

3. Using Sklearn & MinMaxScaler

Above examples, we have just used pandas to normalize columns, now let’s use Sklearn package to do the same. Sklearn package provides different normalized methods to use. Among MinMaxScaler is one. MinMaxScaler subtracts the minimum value in the feature and then divides by the range(the difference between the original maximum and original minimum).

Scaling: Scale means to change the range of the feature‘s values. The shape of the distribution doesn’t change. The scale model of a building has the same proportions as the original(The scale range set at 0 to 1).


# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

As explained above, you can also achieve the same output using min/max without Sklearn package.

4. Using Sklearn & StandardScaler

Unlike the above pandas examples, Sklearn package gives biased estimates while normalize the pandas DataFrame.

StandardScaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does distort the relative distances between the feature values.


# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

Yields below Output:


# Output:
        Fee  Discount
0 -1.224745 -1.224745
1  0.000000  0.000000
2  1.224745  1.224745

5. Simple Transform Acting on the Columns

Let’s see a simple transformation acting on the columns to normalize columns of pandas DataFrame. For example-


# Simple transform acting on the columns.
df= pd.DataFrame(technologies)
df2=df.apply(lambda x: x/x.max(), axis=0)
print(df2)

# Output:
#    Fee  Discount
# 0 -1.0      -1.0
# 1  0.0       0.0
# 2  1.0       1.0

6. Other Examples-1

Another simple approach to normalizing columns, if you have only positive values on DataFrame columns. Note that this only works for column data that ranges [0, n]. For example-


# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Output:
#    Fee  Discount
# 0  0.0       0.0
# 1  0.5       0.5
# 2  1.0       1.0

NOTE: If some column has a negative entry then this code does NOT normalize to the [-1,1] range. The basic way to normalize to [0,1] with negative values.


# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# OutPut:
#    Fee  Discount
# 0 -1.0      -1.0
# 1  0.0       0.0
# 2  1.0       1.0

Example-2: Using .astype() Method

Another simple way to normalize columns of pandas DataFrame with DataFrame.astype().The astype() function is used to cast a pandas object to a specified dtype.


# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
print(df2)

# OutPut:
#         Fee  Discount
# 0  0.333333  0.666667
# 1  0.666667  0.833333
# 2  1.000000  1.000000

NOTE: If your DataFrame has negative numbers are present you DON'T want to normalize you can use DataFrame.astype().


# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
print(df2)

# OutPut:
# 0  NaN       NaN
# 1  NaN       NaN
# 2  1.0       1.0

Example-3:

If you want to improve there are some other examples which we can make simple and better ways.


# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
print(df_normalized)

# Output:
#         Fee  Discount
# 0  0.333333  0.666667
# 1  0.666667  0.833333
# 2  1.000000  1.000000

6. Complete Examples to Normalize Columns of Pandas DataFrame


import pandas as pd
import numpy as np

technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)

# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()

# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)

# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())

# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)

# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)

# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)

# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)

Frequently Asked Questions on Pandas Normalize Columns of DataFrame

What is normalization in the context of a Pandas DataFrame?

Normalization in the context of a Pandas DataFrame involves scaling the values of its columns to a specific range or standardizing them to a common scale. This process is essential for fair comparisons and analysis, particularly when columns have different scales.

What are common normalization methods used in Pandas?

Two common normalization methods are:
Min-Max Normalization: Scales values to a specific range, typically [0, 1].
Mean Normalization: Centers the data by subtracting the mean from each value and scales it based on the standard deviation.

How can I perform Min-Max normalization in Pandas?

You can use the formula (df - df.min()) / (df.max() - df.min()), where df is your DataFrame, to scale values to the range [0, 1].

What impact does normalization have on data analysis?

Normalization ensures that the scale of values doesn’t unduly influence the outcomes of analyses. It can lead to more accurate and fair comparisons, prevent certain algorithms from being dominated by high-magnitude features, and improve the stability and efficiency of optimization algorithms.

Are there different normalization methods for Pandas DataFrames?

In addition to Min-Max and Mean normalization, there are other methods like z-score normalization and custom scaling methods that you can apply based on the nature of your data and analysis requirements.

Can I use the apply function for normalization in Pandas?

The apply function can be used for normalization by applying a custom function or a lambda function along a specified axis (typically axis=0 for column-wise operations). This provides a concise and efficient way to normalize columns.

Conclusion

You have learned about Data Normalization, pandas normalize columns using mean normalization, normalize using Min/Max normalization, using Sklearn MinMax Scalar, using SKlearn StandardScalar, Simple transform acting on the columns, and other simple examples.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply