Spark isin() & IS NOT IN Operator Example

Question: In Spark how to use isin() & IS NOT IN operators that are similar to IN & NOT IN functions available in SQL that check DataFrame column value exists/contains in a list of string values, when I tried to use isin(list_param) from the Column class, I am getting an error java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

spark isin contains string

1. Solution: Using isin() & NOT isin() Operator

In Spark use isin() function of Column class to check if a column value of DataFrame exists/contains in a list of string values. Let’s see with an example. Below example filter the rows language column value present in ‘Java‘ & ‘Scala‘.


val data = Seq(("James","Java"),("Michael","Spark"),("Robert","Python"))
import spark.implicits._
val df = data.toDF("name","language")

// Using isin() function to check value in list of values
val listValues = Seq("Java","Scala")
df.filter(df("language").isin(listValues:_*)).show()

// +-----+--------+
// | name|language|
// +-----+--------+
// |James|    Java|
// +-----+--------+

2. How to check NOT isin() in Spark?

Spark DataFrame API doesn’t have a function to check value not exists in a list of values however you can use NOT operator(!) in conjunction with isin() function to negate the result.


// USE NOT operator '!' to check value NOT IN list of values
df.filter(!df("language").isin(listValues:_*)).show()

// +-------+--------+
// |   name|language|
// +-------+--------+
// |Michael|   Spark|
// | Robert|  Python|
// +-------+--------+

Alternatively, you can also use IN & NOT IN operators


df.filter("language IN ('Java','Scala')").show()
df.filter("language NOT IN ('Java','Scala')").show()

3. Spark SQL Using IN and NOT IN Operators

In Spark SQL, isin() function doesn’t work instead you should use IN and NOT IN operators to check values present and not present in a list of values. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView().


// Spark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
    "language IN ('Java','Scala')").show()

// Spark SQL NOT IN - Check value not in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB where " +
    "language NOT IN ('Java','Scala')").show()

These results same output as above.

Conclusion

In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.

Happy Learning !!

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply