• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:6 mins read
You are currently viewing PySpark – Find Count of null, None, NaN Values

In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when(). In this article, I will explain how to get the count of Null, None, NaN, empty or blank values from all or multiple selected columns of PySpark DataFrame.

Advertisements

Note: In Python None is equal to null value, so on PySpark DataFrame None values are shown as null.

First let’s create a DataFrame with some Null, None, NaN & Empty/Blank values.


import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
    ("James","CA",np.NaN), ("Julia","",None),
    ("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()

+-----+-----+------+
| name|state|number|
+-----+-----+------+
|James|   CA|   NaN|
|Julia|     |  null|
|  Ram| null| 200.0|
|Ramya| NULL|   NaN|
+-----+-----+------+

Find Count of Null, None, NaN of All DataFrame Columns

df.columns returns all DataFrame columns as a list, will loop through the list, and check each column has Null or NaN values. In the below snippet isnan() is a SQL function that is used to check for NAN values and isNull() is a Column class function that is used to check for Null values.


# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
   ).show()

+----+-----+------+
|name|state|number|
+----+-----+------+
|   0|    1|     3|
+----+-----+------+

Note: This example doesn’t count columns containing NULL string literal values, I will cover this in the next section.

To find count for selected columns in a list use a list of column names instead of df.columns. The below example yields the same output as above.


#Find count for selected columns
from pyspark.sql.functions import col,isnan,when,count
df_Columns=["name","state","number"]
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_Columns]
   ).show()

Find Count of None, NULL & Empty String Literal Values

In case you have “Null”, “None”, and empty string literal values, use contains() of PySpark Column class


# Find count for empty, None, Null, Nan with string literals.
from pyspark.sql.functions import col,isnan,when,count
df2 = df.select([count(when(col(c).contains('None') | \
                            col(c).contains('NULL') | \
                            (col(c) == '' ) | \
                            col(c).isNull() | \
                            isnan(c), c 
                           )).alias(c)
                    for c in df.columns])
df2.show()

+----+-----+------+
|name|state|number|
+----+-----+------+
|   0|    3|     3|
+----+-----+------+

Conclusion

In this PySpark article, you have learned how to find a count of Null, None, NaN and Empty/Blank values from all DataFrame columns & selected columns by using count() and when() SQL functions.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium