To Normalize columns of pandas DataFrame we have to learn some concepts first.
Data Normalization: Data Normalization is a typical practice in machine learning which consists of transforming numeric columns to a standard scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process hence we need to normalize the values before running any machine learning algorithms.
Key Points –
- Normalizing columns in a Pandas DataFrame involves scaling the values within each column to a common range, facilitating fair comparisons and analysis.
- Normalizing columns, such as through mean normalization in Pandas, involves centering the data around a statistical measure (mean), facilitating analyses that require a zero-centered distribution.
- Normalization helps prevent bias in analyses where the magnitude of values in different columns might disproportionately influence results, ensuring a more balanced evaluation.
- In machine learning, normalizing features can enhance the performance of models, especially those sensitive to the scale of input variables.
- Applying normalization techniques preserves the integrity of the original data while providing a standardized representation for analytical purposes.
- Normalizing columns can contribute to improved convergence in optimization algorithms and numerical procedures, enhancing the stability and efficiency of data processing tasks.
1. Quick Examples of Normalize Columns of DataFrame
If you’re in a hurry below are quick examples of normalize columns of Pandas DataFrame.
# Quick examples of normalize columns
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
Now, let’s create a pandas DataFrame and execute these examples and validate results. Our DataFrame contains column names Fee
and Discount
.
import pandas as pd
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)
Yields below Output:
# Output:
Fee Discount
0 1000 400
1 2000 500
2 3000 600
2. Pandas Normalize Using Mean Normalization
To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates.
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
print(normalized_df)
Yields below Output:
# Output:
Fee Discount
0 -1.0 -1.0
1 0.0 0.0
2 1.0 1.0
Alternatively, you can also get the same using DataFrame.apply()
and lambda
.
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(normalized_df)
Yields the same output as above.
2. Pandas Normalize Using Min/Max Normalization
Alternatively, you can also normalize columns using min/max normalization. Your Min-Max normalization formula (df - df.min()) / (df.max() - df.min())
is correctly applied column-wise, scaling the values in each column to the range [0, 1]. The resulting normalized_df
DataFrame contains the Min-Max normalized values.
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
print(normalized_df)
Yields below Output:
# Output:
Fee Discount
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3. Using Sklearn & MinMaxScaler
Above examples, we have just used pandas to normalize columns, now let’s use Sklearn package to do the same. Sklearn package provides different normalized methods to use. Among MinMaxScaler
is one. MinMaxScaler
subtracts the minimum
value in the feature and then divides by the range(the difference between the original maximum and original minimum).
Scaling: Scale means to change the range of the feature‘s values. The shape of the distribution doesn’t change. The scale model of a building has the same proportions as the original(The scale range set at 0 to 1).
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
As explained above, you can also achieve the same output using min/max without Sklearn package.
4. Using Sklearn & StandardScaler
Unlike the above pandas examples, Sklearn package gives biased estimates while normalize the pandas DataFrame.
StandardScaler: StandardScaler
standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does distort the relative distances between the feature values.
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
Yields below Output:
# Output:
Fee Discount
0 -1.224745 -1.224745
1 0.000000 0.000000
2 1.224745 1.224745
5. Simple Transform Acting on the Columns
Let’s see a simple transformation acting on the columns to normalize columns of pandas DataFrame. For example-
# Simple transform acting on the columns.
df= pd.DataFrame(technologies)
df2=df.apply(lambda x: x/x.max(), axis=0)
print(df2)
# Output:
# Fee Discount
# 0 -1.0 -1.0
# 1 0.0 0.0
# 2 1.0 1.0
6. Other Examples-1
Another simple approach to normalizing columns, if you have only positive values on DataFrame columns. Note that this only works for column data that ranges [0, n]
. For example-
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Output:
# Fee Discount
# 0 0.0 0.0
# 1 0.5 0.5
# 2 1.0 1.0
NOTE: If some column has a negative
entry then this code does NOT
normalize to the [-1,1]
range. The basic way to normalize to [0,1]
with negative values.
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# OutPut:
# Fee Discount
# 0 -1.0 -1.0
# 1 0.0 0.0
# 2 1.0 1.0
Example-2: Using .astype() Method
Another simple way to normalize columns of pandas DataFrame with DataFrame.astype()
.The astype()
function is used to cast a pandas object to a specified dtype.
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
print(df2)
# OutPut:
# Fee Discount
# 0 0.333333 0.666667
# 1 0.666667 0.833333
# 2 1.000000 1.000000
NOTE: If your DataFrame has negative
numbers are present you DON'T
want to normalize you can use DataFrame.astype()
.
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
print(df2)
# OutPut:
# 0 NaN NaN
# 1 NaN NaN
# 2 1.0 1.0
Example-3:
If you want to improve there are some other examples which we can make simple and better ways.
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
print(df_normalized)
# Output:
# Fee Discount
# 0 0.333333 0.666667
# 1 0.666667 0.833333
# 2 1.000000 1.000000
6. Complete Examples to Normalize Columns of Pandas DataFrame
import pandas as pd
import numpy as np
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
Frequently Asked Questions on Pandas Normalize Columns of DataFrame
Normalization in the context of a Pandas DataFrame involves scaling the values of its columns to a specific range or standardizing them to a common scale. This process is essential for fair comparisons and analysis, particularly when columns have different scales.
Two common normalization methods are:
Min-Max Normalization: Scales values to a specific range, typically [0, 1].
Mean Normalization: Centers the data by subtracting the mean from each value and scales it based on the standard deviation.
You can use the formula (df - df.min()) / (df.max() - df.min())
, where df
is your DataFrame, to scale values to the range [0, 1].
Normalization ensures that the scale of values doesn’t unduly influence the outcomes of analyses. It can lead to more accurate and fair comparisons, prevent certain algorithms from being dominated by high-magnitude features, and improve the stability and efficiency of optimization algorithms.
In addition to Min-Max and Mean normalization, there are other methods like z-score normalization and custom scaling methods that you can apply based on the nature of your data and analysis requirements.
The apply
function can be used for normalization by applying a custom function or a lambda function along a specified axis (typically axis=0 for column-wise operations). This provides a concise and efficient way to normalize columns.
Conclusion
You have learned about Data Normalization, pandas normalize columns using mean normalization, normalize using Min/Max normalization, using Sklearn MinMax Scalar, using SKlearn StandardScalar, Simple transform acting on the columns, and other simple examples.
Related Articles
- Pandas Delete Last Row From DataFrame
- Retrieve Number of Columns From Pandas DataFrame
- Pandas Drop First/Last N Columns From DataFrame
- Pandas Drop First N Rows From DataFrame
- Pandas DatetimeIndex Usage Explained
- Pandas Convert Integer to Datetime Type
- pandas.DataFrame.mean() Examples
- How to use Pandas stack() function
- Pandas melt() DataFrame Example
- How to use Pandas unstack() Function
- Pandas Insert List into Cell of DataFrame
- How to Get Column Average or Mean in Pandas DataFrame