PySpark – Find Count of null, None, NaN Values

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when(). In this article, I will explain how to get the count of Null, None, NaN, empty or blank values from all or multiple selected columns of PySpark DataFrame.

Note: In Python None is equal to null value, son on PySpark DataFrame None values are shown as null.

First let’s create a DataFrame with some Null, None, NaN & Empty/Blank values.


import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
    ("James","CA",np.NaN), ("Julia","",None),
    ("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()

+-----+-----+------+
| name|state|number|
+-----+-----+------+
|James|   CA|   NaN|
|Julia|     |  null|
|  Ram| null| 200.0|
|Ramya| NULL|   NaN|
+-----+-----+------+

Find Count of Null, None, NaN of All DataFrame Columns

df.columns returns all DataFrame columns as a list, will loop through the list, and check each column has Null or NaN values. In the below snippet isnan() is a SQL function that is used to check for NAN values and isNull() is a Column class function that is used to check for Null values.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

+----+-----+------+
|name|state|number|
+----+-----+------+
|   0|    1|     3|
+----+-----+------+

Note: This example doesn’t count columns containing NULL string literal values, I will cover this in the next section so keep reading.

To find count for selected columns in a list use list of column names instead of df.columns. The below example yields the same output as above.


#Find count for selected columns
from pyspark.sql.functions import col,isnan,when,count
df_Columns=["name","state","number"]
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_Columns]
   ).show()

Find Count of None, NULL & Empty String Literal Values

In case if you have “Null”, “None”, and empty string literal values, use contains() of PySpark Column class


# Find count for empty, None, Null, Nan with string literals.
from pyspark.sql.functions import col,isnan,when,count
df2 = df.select([count(when(col(c).contains('None') | \
                            col(c).contains('NULL') | \
                            (col(c) == '' ) | \
                            col(c).isNull() | \
                            isnan(c), c 
                           )).alias(c)
                    for c in df.columns])
df2.show()

+----+-----+------+
|name|state|number|
+----+-----+------+
|   0|    3|     3|
+----+-----+------+

Conclusion

In this PySpark article you have learned how to find a count of Null, None, NaN and Empty/Blank values from all DataFrame columns & selected columns by using count() and when() SQL functions.

Happy Learning !!

References

pyspark find count null nan none

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing PySpark – Find Count of null, None, NaN Values