In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull()
of Column class & SQL functions isnan()
count()
and when(). In this article, I will explain how to get the count of Null
, None
, NaN
, empty
or blank
values from all or multiple selected columns of PySpark DataFrame.
Note: In Python None
is equal to null
value, so on PySpark DataFrame None
values are shown as null
.
First let’s create a DataFrame with some Null, None, NaN & Empty/Blank values.
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [
("James","CA",np.NaN), ("Julia","",None),
("Ram",None,200.0), ("Ramya","NULL",np.NAN)
]
df =spark.createDataFrame(data,["name","state","number"])
df.show()
+-----+-----+------+
| name|state|number|
+-----+-----+------+
|James| CA| NaN|
|Julia| | null|
| Ram| null| 200.0|
|Ramya| NULL| NaN|
+-----+-----+------+
Find Count of Null, None, NaN of All DataFrame Columns
df.columns
returns all DataFrame columns as a list, will loop through the list, and check each column has Null or NaN values. In the below snippet isnan()
is a SQL function that is used to check for NAN values and isNull() is a Column class function that is used to check for Null values.
# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]
).show()
+----+-----+------+
|name|state|number|
+----+-----+------+
| 0| 1| 3|
+----+-----+------+
Note: This example doesn’t count columns containing NULL string literal values, I will cover this in the next section.
To find count for selected columns in a list use a list of column names instead of df.columns
. The below example yields the same output as above.
#Find count for selected columns
from pyspark.sql.functions import col,isnan,when,count
df_Columns=["name","state","number"]
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_Columns]
).show()
Find Count of None, NULL & Empty String Literal Values
In case you have “Null”, “None”, and empty string literal values, use contains() of PySpark Column class
# Find count for empty, None, Null, Nan with string literals.
from pyspark.sql.functions import col,isnan,when,count
df2 = df.select([count(when(col(c).contains('None') | \
col(c).contains('NULL') | \
(col(c) == '' ) | \
col(c).isNull() | \
isnan(c), c
)).alias(c)
for c in df.columns])
df2.show()
+----+-----+------+
|name|state|number|
+----+-----+------+
| 0| 3| 3|
+----+-----+------+
Conclusion
In this PySpark article, you have learned how to find a count of Null, None, NaN and Empty/Blank values from all DataFrame columns & selected columns by using count() and when() SQL functions.
Happy Learning !!
Related Articles
- PySpark isNull() & isNotNull()
- PySpark Count of Non null, nan Values in DataFrame
- PySpark Replace Empty Value With None/null on DataFrame
- PySpark count() – Different Methods Explained
- PySpark fillna() & fill() – Replace NULL/None Values
- PySpark How to Filter Rows with NULL Values
- PySpark Drop Rows with NULL or None Values
- Dynamic way of doing ETL through Pyspark
- PySpark Get Number of Rows and Columns
References