How to handle missing data or values in Pandas DataFrame? Almost in every dataset, we have outliers, unstructured, and missing data. You might also be struggling with missing data handling in a DataFrame. Here is a comprehensive guide where you will learn different possible techniques for handling missing data in a Dataframe using the Pandas library.
Table of contents
- 1. What is Missing Data?
- 2. Example of Missing Data in a Pandas DataFrame
- 3. Why Should You Handle Missing Data in DataFrame?
- 4. How to Find Missing Data in a DataFrame?
- 5. Different Methods to Handle Missing Data In a DataFrame
- 6. Replace Missing Data in DataFrame
- 7. Summary And Conclusion
1. What is Missing Data?
In the world of Data Science, a Pandas DataFrame is the most popular and globally accepted data structure for storing large-scale data in the form of rows and columns just like an excel spreadsheet or SQL table. A DataFrame can contain almost any type of data, however, the missing data in a DataFrame is refer to the values that are unavailable.
2. Example of Missing Data in a Pandas DataFrame
The word “Missing Data in a DataFrame” simply means the values that are unavailable or missing in a Pandas DataFrame. Values that are missing in a DataFrame are automatically replaced by the
NaN type (Here NaN is used from NumPy). In the following example, we have two missing values in a DataFrame which is replaced by the “NaN” value
3. Why Should You Handle Missing Data in DataFrame?
In the process of exploratory data analysis, one of the most important steps is data preprocessing where you will be mainly dealing with missing data handling. Before looking into the insights of data you need a clean dataset, free of outliers and missing values.
You need to handle missing data in a Pandas DataFrame because
- Missing values in a DataFrame negatively affect the data insights
- Training a Machine Learning model needs a clean dataset
- DataFrame with missing values is hard to process, visualize and create a data pipeline
So before you get into real trouble, you need to find out the missing data in your DataFrame and get rid of missing values. Now, you know the actual negative side of missing data, let’s get into the possible solution of missing data handling.
4. How to Find Missing Data in a DataFrame?
Here comes the actual question, how do you even find the missing values in your pandas DataFrame? Worry not, here is a complete step-by-step guide on how to find all missing values in your Pandas DataFrame.
Just to make things easier, for this complete article we will be using a sample Pandas DataFrame that you can easily find out on Kaggle, it is called OOkla Dataset for SpeedTest. The file I use contains 200 rows and 5 columns having NaN values as well.
To find missing data in a DataFrame use the following methods:
4.1 Example 1: Find Rows Having NaN Values
import pandas as pd df=pd.read_csv('data.csv') # Find out Rows having NaN values rows_having_nan_values = df[df.isnull().any(axis=1)] print(rows_having_nan_values)
Yields below output.
4.2 Example 2: Find Columns Having NaN Values
import pandas as pd df=pd.read_csv('data.csv') # Find out Columns that Have NaN values col_having_nan_values=df.loc[:, df.isnull().any()].columns print(col_having_nan_values) # Ouptput: # Index(['Number of Records', 'Devices', 'Tests'], dtype='object')
4.3 Example 3: Find Percentage of Missing Data in Column
Here, DataFrame.isna() is used to check if the DataFrmae has NA values.
import pandas as pd df=pd.read_csv('data.csv') # Percentage on nan values in Each Column percent_nan_in_each_col = df.isna().sum()/(len(df))*100 print(percent_nan_in_each_col)
Yields below output.
4.4 Example 4: Find Number of NaN Values in Each Row w.r.t Column
import pandas as pd df=pd.read_csv('data.csv') # Find out the Number of NaN values in Each Row w.r.t col no_nan_in_each_row=df.isna().sum(axis=1) print(no_nan_in_each_row)
Yields below output.
5. Different Methods to Handle Missing Data In a DataFrame
Based on the data you are working with, you may have to follow any of the following different techniques for handling missing data in a DataFrame. Review all of the methods and apply the one which suits best your need.
The best ways to handle missing data in a DataFrame are:
- Remove rows or columns from the DataFrame that have missing data
- Replace the missing data with another value
Now, you might be confused, where should I apply which methods from the above two? well, it highly depends on your DataFrame and the data points that are missing. don’t be confuse stay with me and you will find your answer.
5.1 Remove Rows or Columns Having Missing Data
This is the most common and easiest method of fixing the issue of missing data in a DataFrame. You might have seen a ton of examples, where we simply remove the Rows or Columns where we have missing data in our DataFrame.
We can simply find out rows, or columns where we have missing data and drop them by using Pandas functions. Well, let’s have examples of this method.
For the example, we will use the same DataFrame but this time considering we have a few missing data points rather than a lot of them.
In the following example, we will remove rows that are having missing values. And, you will be amazed why not remove columns instead or rows. Well, you will find out your answer at the end of this section.
5.1.1 Removing Rows Having Missing Data
In Pandas, we can use the function df.dropna() to remove all rows that have missing data. See the following Python example that shows how to remove rows with missing values.
import pandas as pd df=pd.read_csv('data.csv') # Remove all rows that have atleast one missing value df.dropna(inplace=True) # Remove all rows that have all values missing df.dropna(how='all',inplace=True)
5.1.2 Removing Columns Having Missing Data
Just like removing rows, we can also remove columns from our DataFrame that have missing data. The same pandas built-in function, df.dropna() can be used with an extra “axis” parameter.
See the following python example where we use pandas to remove columns that have missing data. It will remove every column having a missing value in it.
import pandas as pd df=pd.read_csv('data.csv') # Remove all columns that have atleast one missing value df.dropna(axis='columns',inplace=True) # Remove all columns that have all values missing df.dropna(axis='columns',how='all',inplace=True)
5.1.3 The Confusion: Removing Rows VS Removing Columns
A valid question here will be that should I remove rows, or columns just to make my dataset clean. The best approach to handle missing data by removing data points is to remove rows when you have less number of rows with missing data. Or remove columns if it has a high number of missing values in them. Most of the time the choice depends on your data analysis or machine learning algorithms.
6. Replace Missing Data in DataFrame
Yet to decide, in case you might need to replace missing data in pandas DataFrame instead of removing rows or columns. This method is a bit tedious yet a more powerful and optimistic way to handle missing data in DataFrame. You will have a lot of ways to replace the missing data in the DataFrame. We will discuss each one of them with an Example is always.
To replace missing data in a DataFrame you can use the following different methods:
- Replace missing data with fixed values in DataFrame
- Replace missing data with Mean value
- Replace missing data with Median value
These are some of the most common and powerful methods that are used in data analysis to handle missing data in DataFrame. We will discuss each one with an example.
6.1 Replace Missing Data with Fixed Values in dataFrame
We can impute the missing values in the dataFrame by a fixed value. The fixed value can be an Integer or any other data depending on the nature of your Dataset. For example, if you are dealing with gender data, you can replace all the missing values with the word “unknown”, “Male”, or “Female”.
The best way to understand this method is to have an example. In the following example, I have Imputed all missing values by a random number, generated using the python random module.
import pandas as pd import random as rd df=pd.read_csv('data.csv') # Replace all NaN value of Column 'Devices' by Random Number random_no = rd.randint(0,100) df['Devices'].fillna(random_no,inplace=True) print(df)
6.2 Replace Missing Data with Mean Value
In some cases, you can find the mean value of all data points and then replace the missing data with that Mean or average value. Again it strongly depends on the dataset you working on.
You can use the mean values to replace the missing values in case the data distribution is symmetric. You have a choice to choose between the three statistics functions either mean mode, or Median.
Check this Python Example of handling missing data with Mean value. In the following example, we have used the average value of all data points to replace the missing data in our DataFrame.
import pandas as pd df=pd.read_csv('data.csv') # Replace all NaN value of Column 'Rank Download' by Average of all points mean_value =df['Rank Download'].mean() df['Rank Download'].fillna(mean_value,inplace=True) print(df)
6.3 Replace Missing Data with Median Value
The median is the middle value in a set of data. We can also use this median value to fill up the missing data in our pandas DataFrame. Pandas library in python has a built-in function for finding the median value.
Consider replacing filling missing data of a DataFrame with a median value if there is a sudden change in the data points. In our case, it is good to use the median value because our data is pretty much inconsistent.
See the following python example where we first find the median value of the DataFrame columns and then filled all missing values with that Median value.
import pandas as pd df=pd.read_csv('data.csv') # Replace all NaN value of Column 'Rank Download' by medain value median_value =df['Rank Download'].median() df['Rank Download'].fillna(mean_value,inplace=True) print(df)
7. Summary And Conclusion
You have now a clear picture of the complete procedure of handling missing data or values in your pandas DataFrame. You can now decide which method to follow to deal with your missing data. Feel free to comment with your queries we would love to answer them instantly.