Site icon Spark By {Examples}

PySpark NOT isin() or IS NOT IN Operator

pyspark not in

PySpark IS NOT IN condition is used to exclude the defined multiple values in a where() or filter() function condition. In other words, it is used to check/filter if the DataFrame values do not exist/contains in the list of values. isin() is a function of Column class which returns a boolean value True if the value of the expression is contained by the evaluated values of the arguments.

The NOT IN condition (sometimes called the NOT Operator) is used to negate a condition of isin() result.

1. Quick Examples of Using NOT IN

Following are quick examples of how to use the NOT IN operator to filter rows from DataFrame.


# PySpark not isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using NOT IN operator
df.filter("languages in ('Java','Scala')" ).show()

# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "language NOT IN ('Java','Scala')").show()

2. Syntax of NOT IN

Following is the syntax of how to use NOT IN


# Syntax of NOT isin()
~ Column.isin(*cols)

Here, *cols is python syntax for expanding an array to dump its elements into the function parameters one at a time in order. Let’s create a DataFrame and run the above examples.


#Import
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.master("local[1]") 
    .appName("SparkByExamples.com").getOrCreate()

# Create DataFrame    
data = [("James","Java"),("Michael","Spark"),
        ("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()

3. PySpark NOT IN Example

PySpark DataFrame API doesn’t have a function notin() to check value does not exist in a list of values however, you can use NOT operator(~) in conjunction with isin() function to negate the result. Let’s see with an example, below example filter the rows languages column value not present in ‘Java‘ & ‘Scala‘.


# PySpark NOT isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()

from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()

# Using == operator
df.filter(dataframe.column.isin(listValues) == False)

Yields below output.

pyspark not in

4. Using NOT IN Operator with Expression

Let’s see how to use NOT IN operator in the PySpark expression condition.


# Using NOT IN operator
df.filter("languages NOT IN ('Java','Scala')" ).show()

5. PySpark SQL NOT IN Operator

In PySpark SQL, you can use NOT IN operator to check values not exists in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "languages NOT IN ('Java','Scala')").show()

These results same output as above.

6. Conclusion

PySpark IS NOT IN is used to filter rows that are not present or exist in a list/array of values. This is similar to SQL NOT IN operator.

Happy Learning !!

Reference

Exit mobile version