Spark Data Frame Where () To Filter Rows

Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL expression, In this tutorial, you will learn how to apply single and multiple conditions on DataFrame columns using where() function with Scala examples.

1. Spark DataFrame where() Syntaxes


// Spark DataFrame where() Syntaxes
1) where(condition: Column): Dataset[T]
2) where(conditionExpr: String): Dataset[T] //using SQL expression
3) where(func: T => Boolean): Dataset[T]
4) where(func: FilterFunction[T]): Dataset[T]

The first signature is used with condition with Column names using $colname, col("colname"), 'colname and df("colname") with condition expression.

The second signature will be used to provide SQL expressions to filter rows.

The third signature can be used to SQL functions where function applied on each row and the result with “true” are returned.

The fourth signature is used with FilterFunction class.

Before we start with examples, first let’s create a DataFrame.


  val arrayStructureData = Seq(
    Row(Row("James","","Smith"),List("Java","Scala","C++"),"OH","M"),
    Row(Row("Anna","Rose",""),List("Spark","Java","C++"),"NY","F"),
    Row(Row("Julia","","Williams"),List("CSharp","VB"),"OH","F"),
    Row(Row("Maria","Anne","Jones"),List("CSharp","VB"),"NY","M"),
    Row(Row("Jen","Mary","Brown"),List("CSharp","VB"),"NY","M"),
    Row(Row("Mike","Mary","Williams"),List("Python","VB"),"OH","M")
  )

  val arrayStructureSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("languages", ArrayType(StringType))
    .add("state", StringType)
    .add("gender", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
  df.printSchema()
  df.show()

This yields below schema and DataFrame results.


// Output:
root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- languages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

+--------------------+------------------+-----+------+
|                name|         languages|state|gender|
+--------------------+------------------+-----+------+
|    [James, , Smith]|[Java, Scala, C++]|   OH|     M|
|      [Anna, Rose, ]|[Spark, Java, C++]|   NY|     F|
| [Julia, , Williams]|      [CSharp, VB]|   OH|     F|
|[Maria, Anne, Jones]|      [CSharp, VB]|   NY|     M|
|  [Jen, Mary, Brown]|      [CSharp, VB]|   NY|     M|
|[Mike, Mary, Will...|      [Python, VB]|   OH|     M|
+--------------------+------------------+-----+------+

2. DataFrame where() with Column condition

Use Column with the condition to filter the rows from DataFrame, using this you can express complex condition by referring column names using col(name), $"colname" dfObject("colname") , this approach is mostly used while working with DataFrames. Use “===” for comparison.


// DataFrame where() with Column condition
  df.where(df("state") === "OH")
    .show(false)

This yields below DataFrame results.


// Output:
+----------------------+------------------+-----+------+
|name                  |languages         |state|gender|
+----------------------+------------------+-----+------+
|[James, , Smith]      |[Java, Scala, C++]|OH   |M     |
|[Julia, , Williams]   |[CSharp, VB]      |OH   |F     |
|[Mike, Mary, Williams]|[Python, VB]      |OH   |M     |
+----------------------+------------------+-----+------+

3. DataFrame where() with SQL Expression

If you are coming from SQL background, you can use that knowledge in Spark to filter DataFrame rows with SQL expressions.


// DataFrame where() with SQL Expression 
  df.where("gender == 'M'")
    .show(false)

This yields below DataFrame results.


// Output:
+----------------------+------------------+-----+------+
|name                  |languages         |state|gender|
+----------------------+------------------+-----+------+
|[James, , Smith]      |[Java, Scala, C++]|OH   |M     |
|[Maria, Anne, Jones]  |[CSharp, VB]      |NY   |M     |
|[Jen, Mary, Brown]    |[CSharp, VB]      |NY   |M     |
|[Mike, Mary, Williams]|[Python, VB]      |OH   |M     |
+----------------------+------------------+-----+------+

4. Filtering with multiple conditions

To filter rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example, you can extend this with AND(&&), OR(||), and NOT(!) conditional expressions as needed.


  // Multiple condition
  df.where(df("state") === "OH" && df("gender") === "M")
    .show(false)

This yields below DataFrame results.


// Output:

+----------------------+------------------+-----+------+
|name                  |languages         |state|gender|
+----------------------+------------------+-----+------+
|[James, , Smith]      |[Java, Scala, C++]|OH   |M     |
|[Mike, Mary, Williams]|[Python, VB]      |OH   |M     |
+----------------------+------------------+-----+------+

5. Filtering on an Array column

When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. The below example uses array_contains() SQL function which checks if a value contains in an array if present it returns true otherwise false.


// Filtering on an Array column
  df.where(array_contains(df("languages"),"Java"))
    .show(false)

This yields below DataFrame results.


// Output:

+----------------+------------------+-----+------+
|name            |languages         |state|gender|
+----------------+------------------+-----+------+
|[James, , Smith]|[Java, Scala, C++]|OH   |M     |
|[Anna, Rose, ]  |[Spark, Java, C++]|NY   |F     |
+----------------+------------------+-----+------+

6. Filtering on Nested Struct columns

If your DataFrame consists of nested struct columns, you can use any of the above syntaxes to filter the rows based on the nested column.


  // Struct condition
  df.where(df("name.lastname") === "Williams")
    .show(false)

This yields below DataFrame results


// Output:

+----------------------+------------+-----+------+
|name                  |languages   |state|gender|
+----------------------+------------+-----+------+
|[Julia, , Williams]   |[CSharp, VB]|OH   |F     |
|[Mike, Mary, Williams]|[Python, VB]|OH   |M     |
+----------------------+------------+-----+------+

7. Source code of Spark DataFrame using where()


package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.functions.array_contains
object FilterExample extends App{

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  val arrayStructureData = Seq(
    Row(Row("James","","Smith"),List("Java","Scala","C++"),"OH","M"),
    Row(Row("Anna","Rose",""),List("Spark","Java","C++"),"NY","F"),
    Row(Row("Julia","","Williams"),List("CSharp","VB"),"OH","F"),
    Row(Row("Maria","Anne","Jones"),List("CSharp","VB"),"NY","M"),
    Row(Row("Jen","Mary","Brown"),List("CSharp","VB"),"NY","M"),
    Row(Row("Mike","Mary","Williams"),List("Python","VB"),"OH","M")
  )

  val arrayStructureSchema = new StructType()
    .add("name",new StructType()
      .add("firstname",StringType)
      .add("middlename",StringType)
      .add("lastname",StringType))
    .add("languages", ArrayType(StringType))
    .add("state", StringType)
    .add("gender", StringType)

  val df = spark.createDataFrame(
    spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
  df.printSchema()
  df.show()

  // Condition
  df.where(df("state") === "OH")
    .show(false)

  // SQL Expression
  df.where("gender == 'M'")
    .show(false)

  // Multiple condition
  df.where(df("state") === "OH" && df("gender") === "M")
    .show(false)

  // Array condition
  df.where(array_contains(df("languages"),"Java"))
    .show(false)

  // Struct condition
  df.where(df("name.lastname") === "Williams")
    .show(false)

}

Examples explained here are also available at GitHub project for reference.

Conclusion

In this tutorial, I’ve explained how to filter rows from Spark DataFrame based on single or multiple conditions and SQL expression using where() function, also learned filtering rows by providing conditions on the array and struct column with Scala examples.

Alternatively, you also use filter() function to filter the rows on DataFrame.

Thanks for reading. If you like it, please do share the article by following the below social links and any comments or suggestions are welcome in the comments sections!

Happy Learning !!

1. Spark DataFrame where() Syntaxes

2. DataFrame where() with Column condition

3. DataFrame where() with SQL Expression

4. Filtering with multiple conditions

5. Filtering on an Array column

6. Filtering on Nested Struct columns

7. Source code of Spark DataFrame using where()

Conclusion

Related Articles

Naveen Nelamali