Spark isin() & IS NOT IN Operator Example

Question: In Spark & PySpark how to use isin() & IS NOT IN operators that are similar to IN & NOT IN functions available in SQL that check DataFrame column value exists/contains in a list of string values, when I tried to use isin(list_param) from the Column class, I am getting an error java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

spark isin contains string

Solution: Using isin() & NOT isin() Operator

In Spark use isin() function of Column class to check if a column value of DataFrame exists/contains in a list of string values. Let’s see with an example. Below example filter the rows language column value present in ‘Java‘ & ‘Scala‘.


val data = Seq(("James","Java"),("Michael","Spark"),("Robert","Python"))
import spark.implicits._
val df = data.toDF("name","language")

//Using isin() function to check value in list of values
val listValues = Seq("Java","Scala")
df.filter(df("language").isin(listValues:_*)).show()

//+-----+--------+
//| name|language|
//+-----+--------+
//|James|    Java|
//+-----+--------+

How to check NOT isin() in Spark?

Spark DataFrame API doesn’t have a function to check value not exists in a list of values however you can use NOT operator(!) in conjunction with isin() function to negate the result.


//USE NOT operator '!' to check value NOT IN list of values
df.filter(!df("language").isin(listValues:_*)).show()

//+-------+--------+
//|   name|language|
//+-------+--------+
//|Michael|   Spark|
//| Robert|  Python|
//+-------+--------+

Alternatively, you can also use IN & NOT IN operators


df.filter("language IN ('Java','Scala')").show()
df.filter("language NOT IN ('Java','Scala')").show()

PySpark Example of using isin() & NOT isin() Operators

In PySpark also use isin() function of PySpark Column Type to check the value of a DataFrame column present/exists in or not in the list of values.

Use NOT operator (~) to negate the result of the isin() function in PySpark.


data = [("James","Java"),("Michael","Spark"),("Robert","Python")]
df = spark.createDataFrame(data=data,schema=["name","language"])

#Using isin() function to check value IN list of values
listValues = ["Java","Scala"]
df.filter(df.language.isin(listValues)).show()

#USE NOT operator '~' to check value NOT IN list of values
df.filter(~df.language.isin(listValues)).show()

#Also, can be used IN & NOT IN
df.filter("language IN ('Java','Scala')").show()
df.filter("language NOT IN ('Java','Scala')").show()

These PySpark examples results in same output as above.

Spark SQL Using IN and NOT IN Operators

In Spark SQL, isin() function doesn’t work instead you should use IN and NOT IN operators to check values present and not present in a list of values. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


// Spark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "language IN ('Java','Scala')").show()

//Spark SQL NOT IN - Check value not in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB where " +
    "language NOT IN ('Java','Scala')").show()

These results same output as above.

Conclusion

In Spark & PySpark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.

Happy Learning !!

Related Articles

Reference

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply