pandas dropna() Usage & Examples

pandas.DataFrame.dropna() is used to drop/remove missing values from rows and columns, np.nan/pd.NaT (Null/None) are considered as missing values. Before we process the data, it is very important to clean up the data, as part of cleaning we would be required to identify the rows with Null/NaN/None values and drop them. This dropna() method comes in handy to drop rows with np.nan/pd.NaT values.

None/NaN values are one of the major problems in Data Analysis hence before we process either you need to remove rows that have NaN values or replace NaN with empty for String and replace NaN with zero for numeric columns.

In this article, I will explain how to remove a row and column with NaN values by using the pandas dropna() method, also explain how to remove all rows and columns that contain NaN values, and many more examples.

pandas dropna() Key Points

  • pandas.DataFrame.dropna() is used to drop columns with NaN/None values from DataFrame.
  • numpy.nan is Not a Number (NaN), which is of Python build-in numeric type float (floating point).
  • None is of NoneType and it is an object in Python.

1. Quick Examples of pandas dropna() of DataFrame

Below are some quick examples of pandas.DataFrame.dropna() that drop/remove rows for missing values .


# Drop rows that has all Nan Values
df=df.dropna(how='all')

# Drop columns that has all Nan Values
df=df.dropna(how='all',axis=1)

# Default drop rows that contains nan values
df2=df.dropna()

# Drop all columns with NaN values
df2=df.dropna(axis=1)

# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])

# With threshold, 
# Keep only the rows with at least 2 non-NA values.
df2=df.dropna(thresh=3,axis=1)

2. pandas dropna() Syntax

Below is the syntax of the pandas.DataFrame.dropna() method.


# pandas.DataFrame.dropna() syntax
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Now, let’s create a DataFrame with a few rows and columns and execute some examples to learn using dropna(). Our DataFrame contains column names CoursesFeeDuration, and Discount.


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas",np.nan],
    'Fee' :[20000,25000,26000,23093,24000,np.nan],
    'Duration':['30day','40days','35days','45days',np.nan,np.nan],
    'Discount':[1000,np.nan,1200,2500,pd.NaT,np.nan],
    '':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
              }
index_labels=['r1','r2','r3','r4','r5','']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Outputs
#    Courses      Fee Duration Discount    
#r1    Spark  20000.0    30day     1000 NaN
#r2  PySpark  25000.0   40days      NaN NaN
#r3   Hadoop  26000.0   35days     1200 NaN
#r4   Python  23093.0   45days     2500 NaN
#r5   pandas  24000.0      NaN      NaT NaN
#        NaN      NaN      NaN      NaN NaN

3. Drop Rows with All NaN Values

By using pandas.DataFrame.dropna() method you can drop rows & columns with NaN (Not a Number) and None values from DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True.

Use how param to specify how you wanted to remove rows. By default how=any which specified to remove rows when NaN/None is present on any column (missing data on any column). Refer to pandas drop rows with NaN for more examples.


# Drop rows that has all Nan Values
df = df.dropna(how='all')
print(df)

# Outputs
#    Courses      Fee Duration Discount    
#r1    Spark  20000.0    30day     1000 NaN
#r2  PySpark  25000.0   40days      NaN NaN
#r3   Hadoop  26000.0   35days     1200 NaN
#r4   Python  23093.0   45days     2500 NaN
#r5   pandas  24000.0      NaN      NaT NaN

4. Drop Columns with All NaN Values

Similarly to drop columns that contain all NaN use axis=1 param. I have also covered more examples on pandas drop columns with NaN.


# Drop columns that has all Nan Values
df = df.dropna(how='all',axis=1)
print(df)

# Outputs
#    Courses      Fee Duration Discount
#r1    Spark  20000.0    30day     1000
#r2  PySpark  25000.0   40days      NaN
#r3   Hadoop  26000.0   35days     1200
#r4   Python  23093.0   45days     2500
#r5   pandas  24000.0      NaN      NaT

5. Drop Rows & Columns that Contains NaN

To drop rows that contain NaN values, just use the default dropna() method without any params.


# Drop rows that contains nan values
df2=df.dropna()
print(df2)

# Output
#   Courses      Fee Duration Discount
#r1   Spark  20000.0    30day     1000
#r3  Hadoop  26000.0   35days     1200
#r4  Python  23093.0   45days     2500

Now let’s see how to delete columns that contain NaN’s


# Drop columns that contains nan values
df2=df.dropna()
print(df2)

# Outputs
#    Courses      Fee
#r1    Spark  20000.0
#r2  PySpark  25000.0
#r3   Hadoop  26000.0
#r4   Python  23093.0
#r5   pandas  24000.0

Use inplace=True param to perform operations on the existing DataFrame object. For example df.dropna(inplace=True)

6. Execute pandas dropna() on Specific Selected Columns

In case you wanted to execute pandas dropna on a specific column or selected columns, use subset param with column names as a list.


# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])
print(df2)

# Outputs
#    Courses      Fee Duration Discount
#r1    Spark  20000.0    30day     1000
#r2  PySpark  25000.0   40days      NaN
#r3   Hadoop  26000.0   35days     1200
#r4   Python  23093.0   45days     2500

7. Drop NaN Values with Threshold

dropna() also supports threshold param, you can use this to keep only the rows with at least 2 non-NA values.


# With threshold, 
# Keep only the rows with at least 2 non-NA values.
df2=df.dropna(thresh=2)

4. Complete Example of pandas dropna()


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas",np.nan],
    'Fee' :[20000,25000,26000,23093,24000,np.nan],
    'Duration':['30day','40days','35days','45days',np.nan,np.nan],
    'Discount':[1000,np.nan,1200,2500,pd.NaT,np.nan],
    '':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
              }
index_labels=['r1','r2','r3','r4','r5','']
df = pd.DataFrame(technologies,index=index_labels)
print(df)

# Drop rows that has all Nan Values
df=df.dropna(how='all')
print(df)

# Drop columns that has all Nan Values
df=df.dropna(how='all',axis=1)
print(df)

# Default drop rows that contains nan values
df2=df.dropna()
print(df2)

# Drop all columns with NaN values
df2=df.dropna(axis=1)
print(df2)

# Drop rows that has NaN values on selected columns
df2=df.dropna(subset=['Courses','Duration'])
print(df2)
# With threshold, 
# Keep only the rows with at least 2 non-NA values.
df2=df.dropna(thresh=3,axis=1)
print(df2)

Conclusion

In this article, you have learned how to drop rows and columns with missing values by using pandas dropna() method. np.nan/pd.NaT (Null/None) are considered as missing values. None/NaN values are one of the major problems in Data Analysis hence before we process either you need to remove rows that have NaN values or replace NaN with empty for String and replace NaN with zero for numeric columns.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing pandas dropna() Usage & Examples