Spark array_contains()
is an SQL Array function that is used to check if an element value is present in an array type(ArrayType) column on DataFrame. You can use array_contains() function either to derive a new boolean column or filter the DataFrame.
In this example, I will explain both these scenarios.
array_contains()
works like below
- Check if value presents in an array (ArrayType) column. Return one of the below values.
- true – Returns if value presents in an array.
- false – When a value not presents.
- null – when the array is null.
In order to explain how it works, first let’s create a DataFrame.
// Imports
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
val data = Seq(
Row("James,,Smith",List("Java","Scala","C++"),"CA"),
Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
Row("Robert,,Williams",null,"NV")
)
val schema = new StructType()
.add("name",StringType)
.add("languagesAtSchool", ArrayType(StringType))
.add("currentState", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)
1. Using it on withColumn()
First, let’s use this function on withColumn() to derive a new boolean column.
import org.apache.spark.sql.functions.{array_contains,col}
// Using it on withColumn()
val df2=df.withColumn("Java Present",
array_contains(col("languagesAtSchool"),"Java"))
df2.show(false)
This creates a new column “java Present” on new DataFrame.
// Output:
+----------------+------------------+------------+------------+
|name |languagesAtSchool |currentState|Java Present|
+----------------+------------------+------------+------------+
|James,,Smith |[Java, Scala, C++]|CA |true |
|Michael,Rose, |[Spark, Java, C++]|NJ |true |
|Robert,,Williams|null |NV |null |
+----------------+------------------+------------+------------+
Alternatively, you can also use this function on select() and results the same.
2. Using it on where filter()
We can also use array_contains()
to filter the elements from DataFrame.
// Using it on where filter()
val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
df3.show(false)
This yields below output. It returns only elements that has Java present in a “languageAtSchool” array column.
// Output:
+-------------+------------------+------------+
|name |languagesAtSchool |currentState|
+-------------+------------------+------------+
|James,,Smith |[Java, Scala, C++]|CA |
|Michael,Rose,|[Spark, Java, C++]|NJ |
+-------------+------------------+------------+
3. Complete Example of array_contains() function
Below is a complete example of Spark SQL function array_contains() usage on DataFrame.
import org.apache.spark.sql.functions.{array_contains,col}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object ArrayContainsExample extends App {
val spark = SparkSession.builder().appName("SparkByExamples.com")
.master("local[1]")
.getOrCreate()
val data = Seq(
Row("James,,Smith",List("Java","Scala","C++"),"CA"),
Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
Row("Robert,,Williams",null,"NV")
)
val schema = new StructType()
.add("name",StringType)
.add("languagesAtSchool", ArrayType(StringType))
.add("currentState", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)
val df2=df.withColumn("Java Present",
array_contains(col("languagesAtSchool"),"Java"))
df2.show(false)
val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
df3.show(false)
}
Happy Learning !!