While working on PySpark SQL DataFrame we often need to filter rows with NULL/None values on columns, you can do this by checking IS NULL
or IS NOT NULL
conditions.
In many cases, NULL values on columns should be handled before you perform any operations on them, as operations on NULL values result in unexpected values.
Note: PySpark doesn’t support column === null
, when used, it returns an error.
It’s essential to gracefully manage null values as an initial step before proceeding with processing. Additionally, when writing DataFrames to files, it’s crucial to handle null values effectively. It is a good practice to store files without NULL values either by dropping Rows with NULL values on DataFrame or By Replacing NULL values with empty string.
- Filter Rows with NULL Values in DataFrame
- Filter Rows with NULL on Multiple Columns
- Filter Rows with IS NOT NULL or isNotNull
Before we start, Let’s create a DataFrame with rows containing NULL values.
# Imports
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
data = [
("James",None,"M"),
("Anna","NY","F"),
("Julia",None,None)
]
columns = ["name","state","gender"]
df = spark.createDataFrame(data,columns)
df.show()
This produces the below output. As you see, I have columns state
and gender
with NULL values.
# Output
+-----+-----+------+
| name|state|gender|
+-----+-----+------+
|James| null| M|
| Anna| NY| F|
|Julia| null| null|
+-----+-----+------+
Now, let’s see how to filter rows with null values on DataFrame.
1. Filter Rows with NULL Values in DataFrame
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class.
# Filtering NULL rows
df.filter("state is NULL").show()
df.filter(df.state.isNull()).show()
df.filter(col("state").isNull()).show()
The preceding examples yield all rows containing null values in the “state” column, resulting in a new DataFrame. The output remains consistent across all instances provided above.
Note: The filter()
transformation doesn’t directly eliminate rows from the existing DataFrame because of its immutable nature. Instead, it identifies and reports on rows containing null values. Without making an assignment, your actions won’t alter the dataset in any way.
# Output
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|James|null |M |
|Julia|null |null |
+-----+-----+------+
2. Filter Rows with NULL on Multiple Columns
Filtering rows with NULL values on multiple columns involves applying the filter()
transformation with multiple conditions using logical operators such as and
or or
. This allows you to specify criteria for selecting rows where one or more columns have NULL values.
# Filtering on multiple Columns
df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()
Results below output.
# Output
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|Julia|null |null |
+-----+-----+------+
3. Filter Rows with IS NOT NULL or isNotNull
isNotNull() is used to filter rows that are NOT NULL in DataFrame columns.
# Imports
from pyspark.sql.functions import col
df.filter("state IS NOT NULL").show()
df.filter("NOT state IS NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()
Yields below output.
# Output
+----+-----+------+
|name|state|gender|
+----+-----+------+
|Anna|NY |F |
+----+-----+------+
Alternatively, you can also write the same using df.na.drop()
# Filter using na.drop()
df.na.drop(subset=["state"]).show()
4. PySpark SQL Filter Rows with NULL Values
If you are familiar with PySpark SQL, you can check IS NULL and IS NOT NULL to filter the rows from DataFrame.
# Filtering by spark.sql
df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()
5. Complete Example
Below is a complete Scala example of how to filter rows with null values on selected columns.
# Imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark: SparkSession = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
data = [
("James",None,"M"),
("Anna","NY","F"),
("Julia",None,None)
]
columns = ["name","state","gender"]
df =spark.createDataFrame(data,columns)
df.printSchema()
df.show()
df.filter("state is NULL").show()
df.filter(df.state.isNull()).show()
df.filter(col("state").isNull()).show()
df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()
df.filter("state is not NULL").show()
df.filter("NOT state is NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()
df.na.drop(subset=["state"]).show()
df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()
Frequently Asked Questions on how to filter rows with null
We can filter rows with null values in a PySpark DataFrame using the filter
method and the isnull()
function.
For example: df.filter(df["ColumnName"].isNull())
.
To filter rows with null values for multiple columns, you can use the |
(OR) operator within the filter
method. For example: df.filter(df["Column1"].isNull() | df["Column2"].isNull())
.
isNotNull()
function to filter rows with non-null values? We can use the isNotNull()
function in combination with the filter
method to filter rows with non-null values.
For example: df.filter(df["ColumnName"].isNotNull())
.
We can filter rows with null values in all columns by using the na
attribute of the DataFrame.
For example: df.na.drop()
will remove all rows with any null values.
Conclusion
In this PySpark article, you have learned how to filter rows with NULL values from DataFrame using isNull() and isNotNull() (NOT NULL). These come in handy when you need to clean up the DataFrame rows before processing.
Happy Learning !!
Related Articles
- Dynamic way of doing ETL through Pyspark
- PySpark Get Number of Rows and Columns
- PySpark – Find Count of null, None, NaN Values
- PySpark fillna() & fill() – Replace NULL/None Values
- PySpark isNull() & isNotNull()
- PySpark Count of Non null, nan Values in DataFrame
- PySpark Replace Empty Value With None/null on DataFrame
- PySpark Drop Rows with NULL or None Values
Hi Michael, That’s right it doesn’t remove rows instead it just filters. I have updated it.
Thanks for pointing it out.
In reference to the section: “These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.”
But the query does not REMOVE anything – it just reports on the rows that are null. Unless you make an assignment, your statements have not mutated the data set at all.
how to get all the columns with null value, need to put all column separately