Difference Between filter() and where() in Spark?

In Spark, both filter() and where() functions are used to filter out data based on certain conditions. They are used interchangeably, and both of them essentially perform the same operation. In this article, we shall discuss in-detailed about the filter() vs where() functions in Spark and compare each other.

1. Filter() Function

The filter() function is a transformation operation that takes a Boolean expression or a function as an input and applies it to each element in the RDD (Resilient Distributed Datasets) or DataFrame, retaining only the elements that satisfy the condition.

For example, if you have an RDD containing integers, and you want to filter out only the even numbers, you can use the filter function as follows:



// Filter() Example
val rdd = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6))
val filteredRdd = rdd.filter(x => x % 2 == 0)
val eventNum = filteredRdd.collect

// Output
eventNum: Array[Int] = Array(2, 4, 6)

In the above code, x => x % 2 == 0 is the filtering condition that checks if a number is even or not. The resulting filteredRdd will contain only the even numbers from the original RDD.

2. Where() Function

The where() function takes a Boolean expression or a column expression as input and applies it to each row of the DataFrame, retaining only the rows that satisfy the condition.

For example, if you have a DataFrame containing a column called “age” and you want to filter out all the rows where the age is greater than 30, you can use the where function as follows:



// where() example
val df = spark.read.option("header", "true").csv("path/to/data.csv")
val filteredDf = df.where($"age">30)

In the above code, df.where($"age" > 30) is the filtering condition that checks if the age column value is greater than 30 or not. The results filteredDf will contain only the rows where the age is greater than 30.

3. Filter() vs Where()

In Spark Scala, both filter and where functions are used to filter data in RDDs and DataFrames respectively. While they perform the same operation, there are a few differences between them.

Context	Filter	Where
Usage	`filter` is used in RDDs to filter elements that satisfy a Boolean expression or a function.	`where` is used in DataFrames to filter rows that satisfy a Boolean expression or a column expression.
Syntax	`rdd.filter(func)`, where `func` is a function that takes an element as input and returns a Boolean value.	`dataFrame.where(condition)`, where `condition` is a Boolean expression or a column expression that filters the rows based on a condition.
Type	`filter` is a transformation operation that returns a new RDD with the filtered elements.	`where` is a transformation operation that returns a new DataFrame with the filtered rows.
Usage with columns	In RDDs, the `filter` function operates on individual elements of the RDD, and it cannot directly filter columns of data.	In DataFrames, the `where` function can operate on individual columns of the DataFrame and return a new DataFrame with the filtered rows.

Filter vs Where

filter and where are used interchangeably to filter data in Spark Scala, but they have some differences in their usage, syntax, type, and usage with columns. filter is used in RDDs, and where is used in DataFrames, and they perform similar operations to filter data.

4. Examples of where() vs filter()

Here’s an example that compares the use of filter in RDDs and where in DataFrames to achieve the same result:

Suppose you have a dataset of customers with their age and income, and you want to filter out only the customers whose age is greater than 30 and income is greater than 50000. Here’s how you can achieve this using both filter in RDDs and where in DataFrames:

4.1 Using filter() in RDDs:



// Using filter()
val rdd = spark.sparkContext.parallelize(Seq(
  ("Alice", 25, 40000),
  ("Bob", 35, 60000),
  ("Charlie", 45, 80000),
  ("Dave", 55, 100000)
))

val filteredRdd = rdd.filter(x => x._2 > 30 && x._3 > 50000)
filteredRdd.collect

// Output
res1: Array[(String, Int, Int)] = Array((Bob,35,60000), (Charlie,45,80000), (Dave,55,100000))

In the above code, we are using the filter function to filter out only the customers whose age is greater than 30 and whose income is greater than 50000. The result filteredRdd contains only filtered tuples.

4.2 Using where() in DataFrames:


// Using where()
val df = spark.createDataFrame(Seq(
  ("Alice", 25, 40000),
  ("Bob", 35, 60000),
  ("Charlie", 45, 80000),
  ("Dave", 55, 100000)
)).toDF("name", "age", "income")

val filteredDf = df.where($"age">30 && $"income">50000)
filteredDf.show()

// Output
+-------+---+------+
|   name|age|income|
+-------+---+------+
|    Bob| 35| 60000|
|Charlie| 45| 80000|
|   Dave| 55|100000|
+-------+---+------+

In the above code, we are using the where function to filter out only the customers whose age is greater than 30 and whose income is greater than 50000. The results filteredDf contain only the filtered rows.

Both the above code snippets will produce the same output:



// Output

(Bob,35,60000)
(Charlie,45,80000)
(Dave,55,100000)

However, the syntax and semantics of filter in RDDs and where in DataFrames are different, as explained earlier. filter is used to filter individual elements of an RDD, whereas where is used to filter rows of a DataFrame based on a condition.

5. Conclusion

To summarize, both filter() and where() are used for filtering data in Spark Scala, but they have different syntax and semantics depending on whether you are using them with RDDs or DataFrames.

In general, when working with structured data, it is better to use DataFrames instead of RDDs because DataFrames offer better performance optimizations, such as query optimization and predicate pushdown. However, if you are working with unstructured data, such as text files or binary files, RDDs may be a better option.

Table of contents