You are currently viewing PySpark NOT isin() or IS NOT IN Operator

The NOT isin() operation in PySpark is used to filter rows in a DataFrame where the column’s value is not present in a specified list of values. This is accomplished by using the isin() function combined with the negation operator ~.

Advertisements

The resulting boolean column indicates True for rows where the value is absent from the list, effectively excluding those values from the DataFrame. isin() is a function of Column class which returns a boolean value True if the value of the expression is satisfied

The NOT IN condition (sometimes called the NOT Operator) is used to negate a condition of isin() result.

1. NOT IN Examples

Quick examples of using NOT IN operator to filter rows from DataFrame.


# PySpark not isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using NOT IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "language NOT IN ('Java','Scala')").show()

2. Syntax of NOT IN

Usage of NOT IN


# Syntax of NOT isin()
~ Column.isin(*cols)

Here, *cols is python syntax for expanding an array to dump its elements into the function parameters one at a time in order. Let’s create a DataFrame and run the above examples.


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") 
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark NOT IN Example

PySpark DataFrame API doesn’t have a function notin() to check value does not exist in a list of values however, you can use NOT operator(~) in conjunction with isin() function to negate the result. Let’s see with an example, below example filter the rows languages column value not present in ‘Java‘ & ‘Scala‘.


# PySpark NOT isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using == operator
df.filter(dataframe.column.isin(listValues) == False)

Yields below output.

pyspark not in

4. Using NOT IN Operator with Expression

Let’s see how to use NOT IN operator in the PySpark expression condition.


# Using NOT IN operator
df.filter("languages NOT IN ('Java','Scala')" ).show()

5. PySpark SQL NOT IN Operator

In PySpark SQL, you can use NOT IN operator to check values not exists in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you define a temporary view or table using createOrReplaceTempView().


# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages NOT IN ('Java','Scala')").show()

These return the same output as above.

6. Conclusion

In PySpark, to filter rows where a column’s value is not in a specified list of values, you can use the negation of the isin() function. This can be achieved by combining isin() with the ~ operator. This operation returns a boolean column that is True for rows where the column’s value does not match any value in the list.

Happy Learning !!

Reference

This Post Has One Comment

  1. Harlan

    If one of the element is None, the ~ isin filter will not include that row.

Comments are closed.