PySpark NOT isin() or IS NOT IN Operator

  • Post author:
  • Post category:PySpark
  • Post last modified:December 20, 2022
  • Reading time:6 mins read

PySpark IS NOT IN condition is used to exclude the defined multiple values in a where() or filter() function condition. In other words, it is used to check/filter if the DataFrame values do not exist/contains in the list of values. isin() is a function of Column class which returns a boolean value True if the value of the expression is contained by the evaluated values of the arguments.

The NOT IN condition (sometimes called the NOT Operator) is used to negate a condition of isin() result.

1. Quick Examples of Using NOT IN

Following are quick examples of how to use the NOT IN operator to filter rows from DataFrame.


# PySpark not isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using NOT IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "language NOT IN ('Java','Scala')").show()

2. Syntax of NOT IN

Following is the syntax of how to use NOT IN


# Syntax of NOT isin()
~ Column.isin(*cols)

Here, *cols is python syntax for expanding an array to dump its elements into the function parameters one at a time in order. Let’s create a DataFrame and run the above examples.


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") 
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark NOT IN Example

PySpark DataFrame API doesn’t have a function notin() to check value does not exist in a list of values however, you can use NOT operator(~) in conjunction with isin() function to negate the result. Let’s see with an example, below example filter the rows languages column value not present in ‘Java‘ & ‘Scala‘.


# PySpark NOT isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using == operator
df.filter(dataframe.column.isin(listValues) == False)

Yields below output.

pyspark not in

4. Using NOT IN Operator with Expression

Let’s see how to use NOT IN operator in the PySpark expression condition.


# Using NOT IN operator
df.filter("languages NOT IN ('Java','Scala')" ).show()

5. PySpark SQL NOT IN Operator

In PySpark SQL, you can use NOT IN operator to check values not exists in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages NOT IN ('Java','Scala')").show()

These results same output as above.

6. Conclusion

PySpark IS NOT IN is used to filter rows that are not present or exist in a list/array of values. This is similar to SQL NOT IN operator.

Happy Learning !!

Reference

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has One Comment

  1. Harlan

    If one of the element is None, the ~ isin filter will not include that row.