You are currently viewing Spark array_contains() example

Spark array_contains() is an SQL Array function that is used to check if an element value is present in an array type(ArrayType) column on DataFrame. You can use array_contains() function either to derive a new boolean column or filter the DataFrame.

Advertisements

In this example, I will explain both these scenarios.

array_contains() works like below

  • Check if value presents in an array (ArrayType) column. Return one of the below values.
    • true – Returns if value presents in an array.
    • false – When a value not presents.
    • null – when the array is null.

In order to explain how it works, first let’s create a DataFrame.


// Imports
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}

  val data = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",null,"NV")
  )

  val schema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(data),schema)
  df.printSchema()
  df.show(false)

1. Using it on withColumn()

First, let’s use this function on withColumn() to derive a new boolean column.


import org.apache.spark.sql.functions.{array_contains,col}

// Using it on withColumn()
  val df2=df.withColumn("Java Present",
    array_contains(col("languagesAtSchool"),"Java"))
  df2.show(false)

This creates a new column “java Present” on new DataFrame.


// Output:
+----------------+------------------+------------+------------+
|name            |languagesAtSchool |currentState|Java Present|
+----------------+------------------+------------+------------+
|James,,Smith    |[Java, Scala, C++]|CA          |true        |
|Michael,Rose,   |[Spark, Java, C++]|NJ          |true        |
|Robert,,Williams|null              |NV          |null        |
+----------------+------------------+------------+------------+

Alternatively, you can also use this function on select() and results the same.

2. Using it on where filter()

We can also use array_contains() to filter the elements from DataFrame.


// Using it on where filter()
  val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
  df3.show(false)

This yields below output. It returns only elements that has Java present in a “languageAtSchool” array column.


// Output:
+-------------+------------------+------------+
|name         |languagesAtSchool |currentState|
+-------------+------------------+------------+
|James,,Smith |[Java, Scala, C++]|CA          |
|Michael,Rose,|[Spark, Java, C++]|NJ          |
+-------------+------------------+------------+

3. Complete Example of array_contains() function

Below is a complete example of Spark SQL function array_contains() usage on DataFrame.


import org.apache.spark.sql.functions.{array_contains,col}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.{Row, SparkSession}

object ArrayContainsExample extends App {

  val spark = SparkSession.builder().appName("SparkByExamples.com")
    .master("local[1]")
    .getOrCreate()

  val data = Seq(
    Row("James,,Smith",List("Java","Scala","C++"),"CA"),
    Row("Michael,Rose,",List("Spark","Java","C++"),"NJ"),
    Row("Robert,,Williams",null,"NV")
  )

  val schema = new StructType()
    .add("name",StringType)
    .add("languagesAtSchool", ArrayType(StringType))
    .add("currentState", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(data),schema)
  df.printSchema()
  df.show(false)

  val df2=df.withColumn("Java Present",
    array_contains(col("languagesAtSchool"),"Java"))
  df2.show(false)

  val df3=df.where(array_contains(col("languagesAtSchool"),"Java"))
  df3.show(false)
}

Happy Learning !!