PySpark How to Filter Rows with NULL Values

While working on PySpark SQL DataFrame we often need to filter rows with NULL/None values on columns, you can do this by checking IS NULL or IS NOT NULL conditions.

1. Filter Rows with NULL Values in DataFrame

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class.


# Filtering  NULL rows 
df.filter("state is NULL").show()
df.filter(df.state.isNull()).show()
df.filter(col("state").isNull()).show()

The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.

Note: The filter() transformation does not actually remove rows from the current Dataframe due to its immutable nature. It just reports on the rows that are null. Unless you make an assignment, your statements have not mutated the data set at all.


# Output
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|James|null |M     |
|Julia|null |null  |
+-----+-----+------+

2. Filter Rows with NULL on Multiple Columns

Let’s see how to filter rows with NULL values on multiple columns in DataFrame. In order to do so you can use either AND or && operators.


# Filtering on multiple Columns
df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()

Yields below output.


# Output
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|Julia|null |null  |
+-----+-----+------+

3. Filter Rows with IS NOT NULL or isNotNull

isNotNull() is used to filter rows that are NOT NULL in DataFrame columns.


# Imports
from pyspark.sql.functions import col
df.filter("state IS NOT NULL").show()
df.filter("NOT state IS NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()

Yields below output.


# Output
+----+-----+------+
|name|state|gender|
+----+-----+------+
|Anna|NY   |F     |
+----+-----+------+

Alternatively, you can also write the same using df.na.drop()


# Filter using na.drop()
df.na.drop(subset=["state"]).show()

4. PySpark SQL Filter Rows with NULL Values

If you are familiar with PySpark SQL, you can check IS NULL and IS NOT NULL to filter the rows from DataFrame.


# Filtering by spark.sql
df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()

5. Complete Example

Below is a complete Scala example of how to filter rows with null values on selected columns.


# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark: SparkSession = SparkSession.builder \
    .master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()

data = [
    ("James",None,"M"),
    ("Anna","NY","F"),
    ("Julia",None,None)
]

columns = ["name","state","gender"]
df =spark.createDataFrame(data,columns)

df.printSchema()
df.show()

df.filter("state is NULL").show()
df.filter(df.state.isNull()).show()
df.filter(col("state").isNull()).show()

df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()

df.filter("state is not NULL").show()
df.filter("NOT state is NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()
df.na.drop(subset=["state"]).show()

df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()

Frequently Asked Questions on how to filter rows with null

How do I filter rows with null values in a PySpark DataFrame?

We can filter rows with null values in a PySpark DataFrame using the filter method and the isnull() function.
For example: df.filter(df["ColumnName"].isNull()).

How can I filter rows with null values for multiple columns?

To filter rows with null values for multiple columns, you can use the | (OR) operator within the filter method. For example: df.filter(df["Column1"].isNull() | df["Column2"].isNull()).

Can we use the isNotNull() function to filter rows with non-null values?

We can use the isNotNull() function in combination with the filter method to filter rows with non-null values.
For example: df.filter(df["ColumnName"].isNotNull()).

How do I filter rows with null values in all columns?

We can filter rows with null values in all columns by using the na attribute of the DataFrame.
For example: df.na.drop() will remove all rows with any null values.

Conclusion

In this PySpark article, you have learned how to filter rows with NULL values from DataFrame/Dataset using isNull() and isNotNull() (NOT NULL). These come in handy when you need to clean up the DataFrame rows before processing.

Thanks for reading. If you recognize my effort or like articles here please do comment or provide any suggestions for improvements in the comments sections!

Happy Learning !!

References

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

This Post Has 3 Comments

NNK September 9, 2021

Hi Michael, That’s right it doesn’t remove rows instead it just filters. I have updated it.
Thanks for pointing it out.
Michael September 9, 2021

In reference to the section: “These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.”

But the query does not REMOVE anything – it just reports on the rows that are null. Unless you make an assignment, your statements have not mutated the data set at all.
Anonymous February 23, 2021

how to get all the columns with null value, need to put all column separately

1. Filter Rows with NULL Values in DataFrame

2. Filter Rows with NULL on Multiple Columns

3. Filter Rows with IS NOT NULL or isNotNull

4. PySpark SQL Filter Rows with NULL Values

5. Complete Example

Frequently Asked Questions on how to filter rows with null

Conclusion

References

Naveen Nelamali

Leave a Reply

This Post Has 3 Comments

1. Filter Rows with NULL Values in DataFrame

2. Filter Rows with NULL on Multiple Columns

3. Filter Rows with IS NOT NULL or isNotNull

4. PySpark SQL Filter Rows with NULL Values

5. Complete Example

Frequently Asked Questions on how to filter rows with null

Conclusion

Related Articles

References

Naveen Nelamali

Leave a Reply

This Post Has 3 Comments