• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:10 mins read
You are currently viewing PySpark isNull() & isNotNull()

pyspark.sql.Column.isNull() function is used to check if the current expression is NULL/None or column contains a NULL/None value, if it contains it returns a boolean value True. pyspark.sql.Column.isNotNull() function is used to check if the current expression is NOT NULL or column contains a NOT NULL value. if it contains any value it returns True.

While working in PySpark DataFrame we are often required to check if the condition expression result is NULL or NOT NULL and these functions come in handy. This article will also help you understand the difference between PySpark isNull() vs isNotNull()

First, let’s create a DataFrame from list.


# Import
from pyspark.sql import SparkSession

# Create SparkSession 
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame
data = [
    ("James",None,"M"),
    ("Anna","NY","F"),
    ("Julia",None,None)
  ]

columns = ["name","state","gender"]
df = spark.createDataFrame(data,columns)
df.show()

Yields below output. Notice that None in the above example is represented as null on the DataFrame result.

pyspark isnull isnotnull

1. PySpark isNull()

PySpark isNull() method return True if the current expression is NULL/None. isNull() function is present in Column class and isnull() (n being small) is present in PySpark SQL Functions.

  • pyspark.sql.Column.isNull()
  • pyspark.sql.functions.isnull()

1.1. Syntax of isNull()

The following is the syntax of isNull()


# Syntax of isNull()
Column.isNull()

# Syntax of isnull()
pyspark.sql.functions.isnull(col)

1.2. PySpark Column.isNull() Usage with Examples

To select rows that have a null value on a selected column use filter() with isNULL() of PySpark Column class.

Note: The filter() transformation does not actually remove rows from the current Dataframe due to its immutable nature. It just reports on the rows that are null. Unless you make an assignment, your statements have not mutated the data set at all.

The below statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the below examples return the same output.


# Using Column.isNull()
from pyspark.sql.functions import col
df.filter(df.state.isNull()).show()
df.filter("state is NULL").show()
df.filter(col("state").isNull()).show()

Yields below output

pyspark isnull

1.3 Using Multiple Conditions

Let’s see how to select rows with NULL values on multiple columns in DataFrame. In order to do so, you can use either AND or & operators.


df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()

Yields below output.

pyspark sql isnull

1.4 PySpark SQL Function isnull()

pyspark.sql.functions.isnull() is another function that can be used to check if the column value is null. In order to use this function first you need to import it by using from pyspark.sql.functions import isnull


# functions.isnull()
from pyspark.sql.functions import isnull
df.select(isnull(df.state)).show()

2. PySpark isNotNull()

pyspark.sql.Column.isNotNull – PySpark isNotNull() method returns True if the current expression is NOT NULL/None. This function is only present in the Column class and there is no equivalent in sql.function.

2.1 Syntax of isNotNull()

The following is the syntax of Column.isNotNull()


# Suntax of isNotNull()
Column.isNotNull()

2.2 PySpark isNotNull() Usage with Example

The below example uses PySpark isNotNull() function from Column class to check if a column has a NOT NULL value.


# isNotNull() Examples
from pyspark.sql.functions import col
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()

df.filter("state IS NOT NULL").show()
df.filter("NOT state IS NULL").show()

Yields below output.

pyspark isnotnull

3. PySpark SQL Query

When you use PySpark SQL I don’t think you can use isNull() vs isNotNull() functions however there are other ways to check if the column has NULL or NOT NULL.


df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()

Complete Example

Following is complete example of using PySpark isNull() vs isNotNull() functions.


#Import
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame
data = [
    ("James",None,"M"),
    ("Anna","NY","F"),
    ("Julia",None,None)
  ]

columns = ["name","state","gender"]
df = spark.createDataFrame(data,columns)
df.show()

# Using isNull()
df.filter("state is NULL").show()
df.filter(df.state.isNull()).show()

from pyspark.sql.functions import col
df.filter(col("state").isNull()).show()

df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNull() & df.gender.isNull()).show()

from pyspark.sql.functions import isnull
df.select(isnull(df.state)).show()

# Using isNotNull()
from pyspark.sql.functions import col
df.filter("state IS NOT NULL").show()
df.filter("NOT state IS NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()

df.na.drop(subset=["state"]).show()

# Using pySpark SQL
df.createOrReplaceTempView("DATA")
spark.sql("SELECT * FROM DATA where STATE IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NULL AND GENDER IS NULL").show()
spark.sql("SELECT * FROM DATA where STATE IS NOT NULL").show()

Conclusion

In this PySpark article, you have learned how to check if a column has value or not by using isNull() vs isNotNull() functions and also learned using pyspark.sql.functions.isnull().

Related Articles

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium