Pandas Normalize Columns of DataFrame

To Normalize columns of pandas DataFrame we have to learn some concepts first.

Data Normalization: Data Normalization is a typical practice in machine learning which consists of transforming numeric columns to a standard scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process hence we need to normalize the values before running any machine learning algorithms.

1. Quick Examples of Normalize Columns of Pandas DataFrame

If you’re in a hurry below are the quick examples.


# Below are Quick Examples.
# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()

# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)

# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())

# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)

# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)

# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)

# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)

Now, let’s create a pandas DataFrame and execute these examples and validate results. Our DataFrame contains column names  Fee and Discount.


import pandas as pd
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)

Yields below Output:


    Fee  Discount
0  1000       400
1  2000       500
2  3000       600

2. Pandas Normalize Using Mean Normalization

To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates.


# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()
print(normalized_df)

Yields below Output:


   Fee  Discount
0 -1.0      -1.0
1  0.0       0.0
2  1.0       1.0

Alternatively, you can also get the same using DataFrame.apply() and lambda.


# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(normalized_df)

Yields same output as above.,

2. Pandas Normalize Using Min/Max Normalization

Alternatively, you can also normalize columns using min/max normalization.


# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())
print(normalized_df)

Yields below Output:


   Fee  Discount
0  0.0       0.0
1  0.5       0.5
2  1.0       1.0

3. Using Sklearn & MinMaxScaler

Above examples, we have just used pandas to normalize columns, now let’s use Sklearn package to do the same. Sklearn package provides different normalized methods to use. Among MinMaxScaler is one. MinMaxScaler subtracts the minimum value in the feature and then divides by the range(the difference between the original maximum and original minimum).

Scaling: Scale means to change the range of the feature‘s values. The shape of the distribution doesn’t change. The scale model of a building has the same proportions as the original(The scale range set at 0 to 1).


# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

As explained above, you can also achieve the same output using min/max without Sklearn package.

4. Using Sklearn & StandardScaler

Unlike the above pandas examples, Sklearn package gives biased estimates while normalize the pandas DataFrame.

StandardScaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation. StandardScaler does distort the relative distances between the feature values.


# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

Yields below Output:


        Fee  Discount
0 -1.224745 -1.224745
1  0.000000  0.000000
2  1.224745  1.224745

5. Simple Transform Acting on the Columns

Let’s see a simple transformation acting on the columns to normalize columns of pandas DataFrame. For example-


# Simple transform acting on the columns.
df= pd.DataFrame(technologies)
df2=df.apply(lambda x: x/x.max(), axis=0)
print(df2)

#Output:
   Fee  Discount
0 -1.0      -1.0
1  0.0       0.0
2  1.0       1.0

6. Other Examples-1

Another simple approach to normalizing columns, if you have only positive values on DataFrame columns. Note that this only works for column data that ranges [0, n]. For example-


# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Output:
   Fee  Discount
0  0.0       0.0
1  0.5       0.5
2  1.0       1.0

NOTE: If some column has a negative entry then this code does NOT normalize to the [-1,1] range. The basic way to normalize to [0,1] with negative values.


# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# OutPut:
   Fee  Discount
0 -1.0      -1.0
1  0.0       0.0
2  1.0       1.0

Example-2: Using .astype() Method

Another simple way to normalize columns of pandas DataFrame with DataFrame.astype().The astype() function is used to cast a pandas object to a specified dtype.


# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)
print(df2)

# OutPut:
        Fee  Discount
0  0.333333  0.666667
1  0.666667  0.833333
2  1.000000  1.000000

NOTE: If your DataFrame has negative numbers are present you DON'T want to normalize you can use DataFrame.astype().


# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)
print(df2)

#OutPut:
0  NaN       NaN
1  NaN       NaN
2  1.0       1.0

Example-3:

If you want to improve there are some other examples which we can make simple and better ways.


# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)
print(df_normalized)

# Output:
        Fee  Discount
0  0.333333  0.666667
1  0.666667  0.833333
2  1.000000  1.000000

6. Complete Examples to Normalize Columns of Pandas DataFrame


import pandas as pd
technologies= pd.DataFrame({"Fee": [1000, 2000, 3000], "Discount": [400, 500, 600]})
df= pd.DataFrame(technologies)
print(df)

# Pandas Normalize Using Mean Normalization.
normalized_df=(df-df.mean())/df.std()

# Alternate method to normalize using Mean Normalization.
normalized_df=df.apply(lambda x: (x-x.mean())/ x.std(), axis=0)

# Normalize using Min/Max Normalization.
normalized_df=(df-df.min())/(df.max()-df.min())

# Using Sklearn & MinMax Scalar.
import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
normalized_df= pd.DataFrame(x_scaled)

# Using SKlearn & StandardScalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.iloc[:,0:]=scaler.fit_transform(df.iloc[:,0:].to_numpy())
print(df)

# Simple transform acting on the columns.
df2=df.apply(lambda x: x/x.max(), axis=0)

# Example1 for column has positive entry.
df["Fee"] = df["Fee"] / df["Fee"].max()

# Column has a negative entry code does NOT normalize.
df["Fee"] = (df["Fee"]-df["Fee"].min()) / (df["Fee"].max()-df["Fee"].min())

# Normalize columns using .astype() method.
df2 = df/df.max().astype(np.float64)

# Negative numbers that don't want to normalize.
df2 = df/df.loc[df.abs().idxmax()].astype(np.float64)

# Another simple way to normalize columns of pandas DataFrame.
df_normalized = df / df.max(axis=0)

Conclusion

You have learned about Data Normalization, pandas normalize columns using mean normalization, normalize using Min/Max normalization, using Sklearn MinMax Scalar, using SKlearn StandardScalar, Simple transform acting on the columns and other simple examples.

You may also like

References

normalize columns Pandas DataFrame

Leave a Reply

You are currently viewing Pandas Normalize Columns of DataFrame