Question: In Spark how to use isin() & IS NOT IN operators that are similar to IN & NOT IN functions available in SQL that check DataFrame column value exists/contains in a list of string values, when I tried to use isin(list_param) from the Column class, I am getting an error java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon
1. Solution: Using isin() & NOT isin() Operator
In Spark use isin()
function of Column class to check if a column value of DataFrame exists/contains in a list of string values. Let’s see with an example. Below example filter the rows language
column value present in ‘Java
‘ & ‘Scala
‘.
val data = Seq(("James","Java"),("Michael","Spark"),("Robert","Python"))
import spark.implicits._
val df = data.toDF("name","language")
// Using isin() function to check value in list of values
val listValues = Seq("Java","Scala")
df.filter(df("language").isin(listValues:_*)).show()
// +-----+--------+
// | name|language|
// +-----+--------+
// |James| Java|
// +-----+--------+
2. How to check NOT isin() in Spark?
Spark DataFrame API doesn’t have a function to check value not exists in a list of values however you can use NOT operator(!) in conjunction with isin() function to negate the result.
// USE NOT operator '!' to check value NOT IN list of values
df.filter(!df("language").isin(listValues:_*)).show()
// +-------+--------+
// | name|language|
// +-------+--------+
// |Michael| Spark|
// | Robert| Python|
// +-------+--------+
Alternatively, you can also use IN & NOT IN operators
df.filter("language IN ('Java','Scala')").show()
df.filter("language NOT IN ('Java','Scala')").show()
3. Spark SQL Using IN and NOT IN Operators
In Spark SQL, isin() function doesn’t work instead you should use IN
and NOT IN
operators to check values present and not present in a list of values. In order to use SQL, make sure you create a temporary view using createOrReplaceTempView()
.
// Spark SQL IN - check value in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB WHERE " +
"language IN ('Java','Scala')").show()
// Spark SQL NOT IN - Check value not in a list of values
df.createOrReplaceTempView("TAB")
spark.sql("SELECT * FROM TAB where " +
"language NOT IN ('Java','Scala')").show()
These results same output as above.
Conclusion
In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.
Happy Learning !!
Related Articles
- Spark How to filter using contains(), like() Examples
- Spark array_contains() example
- Apache Spark Interview Questions
- Spark – Working with collect_list() and collect_set() functions
- Spark DataFrame count
- Spark groupByKey()
- Spark JDBC Parallel Read