While working on Spark DataFrame we often need to filter rows with NULL values on DataFrame columns, you can do this by checking IS NULL
or IS NOT NULL
conditions.
In many cases NULL on columns needs to handles before you performing any operations on columns as operations on NULL values results in unexpected values. Hence, it is always good practice to clean up before we processing.
Note: Spark doesn’t support column === null
, when used it returns error.
We need to graciously handle null values as the first step before processing. Also, While writing DataFrame to the files, it’s a good practice to store files with out NULL values either by dropping Rows with NULL values on DataFrame or By Replacing NULL values with empty string.
- Filter Rows with NULL Values in DataFrame
- Filter Rows with NULL on Multiple Columns
- Filter Rows with IS NOT NULL or isNotNull
Before we start, Let’s create a DataFrame with rows containing NULL values.
val data = Seq(
("James",null,"M"),
("Anna","NY","F"),
("Julia",null,null)
)
import spark.implicits._
val columns = Seq("name","state","gender")
val df = data.toDF(columns:_*)
df.show()
This yields the below output. As you see I have columns state
and gender
with NULL values.
+-----+-----+------+
| name|state|gender|
+-----+-----+------+
|James| null| M|
| Anna| NY| F|
|Julia| null| null|
+-----+-----+------+
Now, let’s see how to filter rows with null values on DataFrame.
Filter Rows with NULL Values in DataFrame
In Spark, using filter()
or where()
functions of DataFrame we can filter rows with NULL values by checking IS NULL
or isNULL
.
df.filter("state is NULL").show(false)
df.filter(df("state").isNull).show(false)
df.filter(col("state").isNull).show(false) //Required col function import
These removes all rows with null values on state
column and returns the new DataFrame. All above examples returns the same output.
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|James|null |M |
|Julia|null |null |
+-----+-----+------+
Alternatively, you can also write the same using <a href="https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/">df.na.drop()</a>
df.na.drop(Seq("state")).show(false)
Filter Rows with NULL on Multiple Columns
Let’s see how to filter rows with NULL values on multiple columns in DataFrame. In order to do so you can use either AND or && operators.
df.filter("state is NULL AND gender is NULL").show(false)
df.filter(df("state").isNull && df("gender").isNull).show(false)
df.filter(col("state").isNull && col("gender").isNull).show(false) //Required col function import
Yields below output.
+-----+-----+------+
|name |state|gender|
+-----+-----+------+
|Julia|null |null |
+-----+-----+------+
Filter Rows with IS NOT NULL or isNotNull
IS NOT NULL
or isNotNull
is used to filter rows that are NOT NULL in Spark DataFrame columns.
df.filter("state is not NULL").show(false)
df.filter("NOT state is NULL").show(false)
df.filter(df("state").isNotNull).show(false)
df.filter(col("state").isNotNull).show(false) //Required col function import
Yields below output.
+----+-----+------+
|name|state|gender|
+----+-----+------+
|Anna|NY |F |
+----+-----+------+
Spark SQL Filter Rows with NULL Values
If you are familiar with Spark SQL, you can check IS NULL and IS NOT NULL to filter the rows from DataFrame.
df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show(false)
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show(false)
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show(false)
Complete Example
Below is a complete Scala example of how to filter rows with null values on selected columns.
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions.col
object FilterNullRowsExample extends App{
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val data = Seq(
("James",null,"M"),
("Anna","NY","F"),
("Julia",null,null)
)
import spark.implicits._
val columns = Seq("name","state","gender")
val df = data.toDF(columns:_*)
df.printSchema()
df.show()
df.filter("state is NULL").show(false)
df.filter(df("state").isNull).show(false)
df.filter(col("state").isNull).show(false)
df.filter("state is not NULL").show(false)
df.filter("NOT state is NULL").show(false)
df.filter(df("state").isNotNull).show(false)
df.filter(col("state").isNotNull).show(false)
df.filter("state is NULL AND gender is NULL").show(false)
df.filter(df("state").isNull && df("gender").isNull).show(false)
}
Conclusion
In this article, you have learned how to filter rows with NULL values from DataFrame/Dataset using IS NULL/isNull and IS NOT NULL/isNotNull. These come in handy when you need to clean up the DataFrame rows before processing.
Thanks for reading. If you recognize my effort or like articles here please do comment or provide any suggestions for improvements in the comments sections!
Happy Learning !!
Thank you!