The NOT isin()
operation in PySpark is used to filter rows in a DataFrame where the column’s value is not present in a specified list of values. This is accomplished by using the isin()
function combined with the negation operator ~
.
The resulting boolean column indicates True
for rows where the value is absent from the list, effectively excluding those values from the DataFrame. isin() is a function of Column class which returns a boolean value True if the value of the expression is satisfied
The NOT IN condition (sometimes called the NOT Operator) is used to negate a condition of isin() result.
1. NOT IN Examples
Quick examples of using NOT IN operator to filter rows from DataFrame.
# PySpark not isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()
from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()
# Using NOT IN operator
df.filter("languages in ('Java','Scala')" ).show()
# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
"language NOT IN ('Java','Scala')").show()
2. Syntax of NOT IN
Usage of NOT IN
# Syntax of NOT isin()
~ Column.isin(*cols)
Here, *cols is python syntax for expanding an array to dump its elements into the function parameters one at a time in order. Let’s create a DataFrame and run the above examples.
#Import
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.master("local[1]")
.appName("SparkByExamples.com").getOrCreate()
# Create DataFrame
data = [("James","Java"),("Michael","Spark"),
("Robert","Python")]
columns = ["name","languages"]
df = spark.createDataFrame(data,columns)
df.show()
3. PySpark NOT IN Example
PySpark DataFrame API doesn’t have a function notin() to check value does not exist in a list of values however, you can use NOT operator(~) in conjunction with isin() function to negate the result. Let’s see with an example, below example filter the rows languages
column value not present in ‘Java
‘ & ‘Scala
‘.
# PySpark NOT isin()
listValues = ["Java","Scala"]
df.filter(~df.languages.isin(listValues)).show()
from pyspark.sql.functions import col
df.filter(~col("languages").isin(listValues)).show()
# Using == operator
df.filter(dataframe.column.isin(listValues) == False)
Yields below output.
4. Using NOT IN Operator with Expression
Let’s see how to use NOT IN
operator in the PySpark expression condition.
# Using NOT IN operator
df.filter("languages NOT IN ('Java','Scala')" ).show()
5. PySpark SQL NOT IN Operator
In PySpark SQL, you can use NOT IN
operator to check values not exists in a list of values, it is usually used with the WHERE clause. In order to use SQL, make sure you define a temporary view or table using createOrReplaceTempView().
# PySpark SQL NOT IN
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
"languages NOT IN ('Java','Scala')").show()
These return the same output as above.
6. Conclusion
In PySpark, to filter rows where a column’s value is not in a specified list of values, you can use the negation of the isin()
function. This can be achieved by combining isin()
with the ~
operator. This operation returns a boolean column that is True
for rows where the column’s value does not match any value in the list.
Happy Learning !!
Related Articles
- PySpark How to Filter Rows with NULL Values
- PySpark Where Filter Function | Multiple Conditions
- PySpark isin() Function
- PySpark isNull() & isNotNull()
- PySpark Drop Rows with NULL or None Values
- PySpark withColumn() Usage with Examples
- PySpark split() Column into Multiple Columns
- PySpark Column Class | Operators & Functions
- PySpark Convert String to Array Column
If one of the element is None, the ~ isin filter will not include that row.