pandas.DataFrame.dropna() is used to drop/remove missing values from rows and columns, np.nan/pd.NaT (Null/None) are considered as missing values. Before we process the data, it is very important to clean up the missing data, as part of cleaning we would be required to identify the rows with Null/NaN/None values and drop them. This dropna() method comes in handy to drop rows with np.nan/pd.NaT values.
None/NaN values are one of the major problems in Data Analysis hence before we process either you need to remove rows that have NaN values or replace NaN with empty for String and replace NaN with zero for numeric columns.
In this article, I will explain how to remove a row and column with NaN values by using the pandas dropna() method, also explain how to remove all rows and columns that contain NaN values, and many more examples.
Key Points –
pandas.DataFrame.dropna()
is used to drop columns withNaN
/None
values from DataFrame.numpy.nan
is Not a Number (NaN), which is of Python build-in numeric type float (floating point).- Set
axis=1
to drop columns containing NaN values instead of rows. None
is of NoneType and it is an object in Python.- Use
how='all'
to remove rows or columns only if every entry is NaN. - Specify
thresh
to keep rows or columns that meet a minimum count of non-NaN values. - Apply
dropna()
conditionally by specifying columns insubset
where non-NaN values are required.
Quick Examples of DataFrame dropna()
Below are some quick examples of pandas.DataFrame.dropna() that drop/remove rows for missing values .
# Quick examples of DataFrame dropna()
# Drop rows that has all Nan Values
df=df.dropna(how='all')
# Drop columns that has all Nan Values
df=df.dropna(how='all',axis=1)
# Default drop rows that contains nan values
df2=df.dropna()
# Drop all columns with NaN values
df2=df.dropna(axis=1)
# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])
# With threshold,
# Keep only the rows with at least 2 non-NA values
df2=df.dropna(thresh=3,axis=1)
Pandas dropna() Syntax
Below is the syntax of the pandas.DataFrame.dropna() method.
# Pandas.DataFrame.dropna() syntax
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Now, let’s create a DataFrame with a few rows and columns and execute some examples to learn using dropna()
. Our DataFrame contains column names Courses
, Fee
, Duration
, and Discount
.
# Create DataFrame
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas",np.nan],
'Fee' :[20000,25000,26000,23093,24000,np.nan],
'Duration':['30day','40days','35days','45days',np.nan,np.nan],
'Discount':[1000,np.nan,1200,2500,pd.NaT,np.nan],
'':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
}
index_labels=['r1','r2','r3','r4','r5','']
df = pd.DataFrame(technologies,index=index_labels)
print(df)
# Output:
# Courses Fee Duration Discount
# r1 Spark 20000.0 30day 1000 NaN
# r2 PySpark 25000.0 40days NaN NaN
# r3 Hadoop 26000.0 35days 1200 NaN
# r4 Python 23093.0 45days 2500 NaN
# r5 pandas 24000.0 NaN NaT NaN
# NaN NaN NaN NaN NaN
Drop Rows with All NaN Values
By using pandas.DataFrame.dropna()
method you can drop rows & columns with NaN (Not a Number) and None values from DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True
.
Use how
param to specify how you wanted to remove rows. By default how=any
which specified to remove rows when NaN/None is present on any column (missing data on any column). Refer to pandas drop rows with NaN for more examples.
# Drop rows that has all Nan Values
df = df.dropna(how='all')
print(df)
# Outputs:
# Courses Fee Duration Discount
# r1 Spark 20000.0 30day 1000 NaN
# r2 PySpark 25000.0 40days NaN NaN
# r3 Hadoop 26000.0 35days 1200 NaN
# r4 Python 23093.0 45days 2500 NaN
# r5 pandas 24000.0 NaN NaT NaN
Drop Columns with All NaN Values
Similarly to drop columns that contain all NaN use axis=1
param. I have also covered more examples on pandas drop columns with NaN.
# Drop columns that has all Nan Values
df = df.dropna(how='all',axis=1)
print(df)
# Output:
# Courses Fee Duration Discount
# r1 Spark 20000.0 30day 1000
# r2 PySpark 25000.0 40days NaN
# r3 Hadoop 26000.0 35days 1200
# r4 Python 23093.0 45days 2500
# r5 pandas 24000.0 NaN NaT
Drop Rows & Columns that Contains NaN
To drop rows that contain NaN values, just use the default dropna()
method without any params.
# Drop rows that contains nan values
df2=df.dropna()
print(df2)
# Output:
# Courses Fee Duration Discount
# r1 Spark 20000.0 30day 1000
# r3 Hadoop 26000.0 35days 1200
# r4 Python 23093.0 45days 2500
Now let’s see how to delete columns that contain NaN’s
# Drop columns that contains nan values
df2=df.dropna()
print(df2)
# Output:
# Courses Fee
# r1 Spark 20000.0
# r2 PySpark 25000.0
# r3 Hadoop 26000.0
# r4 Python 23093.0
# r5 pandas 24000.0
Use inplace=True
param to perform operations on the existing DataFrame object. For example df.dropna(inplace=True)
Execute pandas dropna() on Specific Selected Columns
In case you wanted to execute pandas dropna on a specific column or selected columns, use subset
param with column names as a list.
# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])
print(df2)
# Outputs:
# Courses Fee Duration Discount
# r1 Spark 20000.0 30day 1000
# r2 PySpark 25000.0 40days NaN
# r3 Hadoop 26000.0 35days 1200
# r4 Python 23093.0 45days 2500
Drop NaN Values with Threshold
dropna()
also supports threshold param, you can use this to keep only the rows with at least 2 non-NA values.
# With threshold,
# Keep only the rows with at least 2 non-NA values.
df2=df.dropna(thresh=2)
Complete Example of pandas dropna()
# Complete Example of pandas dropna()
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas",np.nan],
'Fee' :[20000,25000,26000,23093,24000,np.nan],
'Duration':['30day','40days','35days','45days',np.nan,np.nan],
'Discount':[1000,np.nan,1200,2500,pd.NaT,np.nan],
'':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
}
index_labels=['r1','r2','r3','r4','r5','']
df = pd.DataFrame(technologies,index=index_labels)
print(df)
# Drop rows that has all Nan Values
df=df.dropna(how='all')
print(df)
# Drop columns that has all Nan Values
df=df.dropna(how='all',axis=1)
print(df)
# Default drop rows that contains nan values
df2=df.dropna()
print(df2)
# Drop all columns with NaN values
df2=df.dropna(axis=1)
print(df2)
# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])
print(df2)
# With threshold,
# Keep only the rows with at least 2 non-NA values.
df2=df.dropna(thresh=3,axis=1)
print(df2)
Conclusion
In this article, you have learned how to drop rows and columns with missing values by using pandas dropna()
method. np.nan/pd.NaT (Null/None) are considered as missing values. None/NaN values are one of the major problems in Data Analysis hence before we process either you need to remove rows that have NaN values or replace NaN with empty for String and replace NaN with zero for numeric columns.
Related Articles
- Pandas DataFrame all() Method
- Pandas DataFrame first() Method
- Pandas Series filter() Function
- Pandas DataFrame dot() Method
- Pandas DataFrame std() Method
- Pandas DataFrame bfill() Method
- Pandas DataFrame eval() Function
- Pandas DataFrame cumprod() Method
- Pandas Drop Index Column Explained
- pandas.DataFrame.where() Examples
- pandas.DataFrame.mean() Examples
- Pandas DataFrame cumsum() Method
- Pandas DataFrame diff() Method
- pandas.DataFrame.fillna() – Explained by Examples