To Normalize columns of pandas DataFrame we have to learn some concepts first.
Data Normalization: Data Normalization is a typical practice in machine learning which consists of transforming numeric columns to a standard scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process hence we need to normalize the values before running any machine learning algorithms.
1. Quick Examples of Normalize Columns of Pandas DataFrame
If you’re in a hurry below are quick examples.
# Below are the quick examples.
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
Now, let’s create a pandas DataFrame and execute these examples and validate results. Our DataFrame contains column names Fee
and Discount
.
import pandas as pd
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)
Yields below Output:
# Output:
Fee Discount
0 1000 400
1 2000 500
2 3000 600
2. Pandas Normalize Using Mean Normalization
To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates.
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
print(normalized_df)
Yields below Output:
# Output:
Fee Discount
0 -1.0 -1.0
1 0.0 0.0
2 1.0 1.0
Alternatively, you can also get the same using DataFrame.apply()
and lambda
.
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(normalized_df)
Yields same output as above.
2. Pandas Normalize Using Min/Max Normalization
Alternatively, you can also normalize columns using min/max normalization.
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
print(normalized_df)
Yields below Output:
# Output:
Fee Discount
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
3. Using Sklearn & MinMaxScaler
Above examples, we have just used pandas to normalize columns, now let’s use Sklearn package to do the same. Sklearn package provides different normalized methods to use. Among MinMaxScaler
is one. MinMaxScaler
subtracts the minimum
value in the feature and then divides by the range(the difference between the original maximum and original minimum).
Scaling: Scale means to change the range of the feature‘s values. The shape of the distribution doesn’t change. The scale model of a building has the same proportions as the original(The scale range set at 0 to 1).
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
As explained above, you can also achieve the same output using min/max without Sklearn package.
4. Using Sklearn & StandardScaler
Unlike the above pandas examples, Sklearn package gives biased estimates while normalize the pandas DataFrame.
StandardScaler: StandardScaler
standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does distort the relative distances between the feature values.
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
Yields below Output:
# Output:
Fee Discount
0 -1.224745 -1.224745
1 0.000000 0.000000
2 1.224745 1.224745
5. Simple Transform Acting on the Columns
Let’s see a simple transformation acting on the columns to normalize columns of pandas DataFrame. For example-
# Simple transform acting on the columns.
df= pd.DataFrame(technologies)
df2=df.apply(lambda x: x/x.max(), axis=0)
print(df2)
# Output:
# Fee Discount
# 0 -1.0 -1.0
# 1 0.0 0.0
# 2 1.0 1.0
6. Other Examples-1
Another simple approach to normalizing columns, if you have only positive values on DataFrame columns. Note that this only works for column data that ranges [0, n]
. For example-
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Output:
# Fee Discount
# 0 0.0 0.0
# 1 0.5 0.5
# 2 1.0 1.0
NOTE: If some column has a negative
entry then this code does NOT
normalize to the [-1,1]
range. The basic way to normalize to [0,1]
with negative values.
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# OutPut:
# Fee Discount
# 0 -1.0 -1.0
# 1 0.0 0.0
# 2 1.0 1.0
Example-2: Using .astype() Method
Another simple way to normalize columns of pandas DataFrame with DataFrame.astype()
.The astype()
function is used to cast a pandas object to a specified dtype.
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
print(df2)
# OutPut:
# Fee Discount
# 0 0.333333 0.666667
# 1 0.666667 0.833333
# 2 1.000000 1.000000
NOTE: If your DataFrame has negative
numbers are present you DON'T
want to normalize you can use DataFrame.astype()
.
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
print(df2)
# OutPut:
# 0 NaN NaN
# 1 NaN NaN
# 2 1.0 1.0
Example-3:
If you want to improve there are some other examples which we can make simple and better ways.
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
print(df_normalized)
# Output:
# Fee Discount
# 0 0.333333 0.666667
# 1 0.666667 0.833333
# 2 1.000000 1.000000
6. Complete Examples to Normalize Columns of Pandas DataFrame
import pandas as pd
import numpy as np
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)
# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)
# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)
# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()
# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())
# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
7. Conclusion
You have learned about Data Normalization, pandas normalize columns using mean normalization, normalize using Min/Max normalization, using Sklearn MinMax Scalar, using SKlearn StandardScalar, Simple transform acting on the columns and other simple examples.
Related Articles
- Pandas Delete Last Row From DataFrame
- How to Get Column Average or Mean in Pandas DataFrame
- Retrieve Number of Columns From Pandas DataFrame
- Pandas Drop First/Last N Columns From DataFrame
- Pandas Drop First N Rows From DataFrame
- Pandas DatetimeIndex Usage Explained
- Pandas Convert Integer to Datetime Type