You are currently viewing Spark array_contains() example

Spark array_contains() is an SQL Array function that is used to check if an element value is present in an array type(ArrayType) column on DataFrame. You can use array_contains() function either to derive a new boolean column or filter the DataFrame.

In this example, I will explain both these scenarios.

array_contains() works like below

  • Check if value presents in an array (ArrayType) column. Return one of the below values.
    • true – Returns if value presents in an array.
    • false – When a value not presents.
    • null – when the array is null.

In order to explain how it works, first let’s create a DataFrame.


  val data = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",null,"NV")
  )

  val schema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(data),schema)
  df.printSchema()
  df.show(false)

1. Using it on withColumn()

First, let’s use this function on <a href="https://sparkbyexamples.com/spark/spark-dataframe-withcolumn/">withColumn()</a> to derive a new boolean column.


// Using it on withColumn()
  val df2=df.withColumn("Java Present",
    array_contains(col("languagesAtSchool"),"Java"))
  df2.show(false)

This creates a new column “java Present” on new DataFrame.


// Output:
+----------------+------------------+------------+------------+
|name            |languagesAtSchool |currentState|Java Present|
+----------------+------------------+------------+------------+
|James,,Smith    |[Java, Scala, C++]|CA          |true        |
|Michael,Rose,   |[Spark, Java, C++]|NJ          |true        |
|Robert,,Williams|null              |NV          |null        |
+----------------+------------------+------------+------------+

Alternatively, you can also use this function on select() and results the same.

2. Using it on where filter()

We can also use array_contains() to filter the elements from DataFrame.


// Using it on where filter()
  val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
  df3.show(false)

This yields below output. It returns only elements that has Java present in a “languageAtSchool” array column.


// Output:
+-------------+------------------+------------+
|name         |languagesAtSchool |currentState|
+-------------+------------------+------------+
|James,,Smith |[Java, Scala, C++]|CA          |
|Michael,Rose,|[Spark, Java, C++]|NJ          |
+-------------+------------------+------------+

3. Complete Example of array_contains() function

Below is a complete example of Spark SQL function array_contains() usage on DataFrame.


import org.apache.spark.sql.functions.{array_contains,col}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object ArrayContainsExample extends App {

  val spark = SparkSession.builder().appName("SparkByExamples.com")
    .master("local[1]")
    .getOrCreate()

  val data = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",null,"NV")
  )

  val schema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(data),schema)
  df.printSchema()
  df.show(false)

  val df2=df.withColumn("Java Present",
    array_contains(col("languagesAtSchool"),"Java"))
  df2.show(false)

  val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
  df3.show(false)
}

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium