You are currently viewing Filter Spark DataFrame Based on Date

How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples.

Advertisements

1. Filter Syntax

The filter() function can be used to select a subset of data from a DataFrame or Dataset based on a condition. In Scala, you can use the filter method to apply a filter to a DataFrame or Dataset.

The basic syntax of the filter operation in Scala is as follows:


// Filter Syntax
val filteredDF = originalDF.filter(condition)

where originalDF is the DataFrame you want to filter, condition is a Boolean expression that specifies the filtering condition, and filteredDF is the resulting DataFrame that contains only the rows that satisfy the filtering condition.

2. Examples of Date based filtering

You can use Spark SQL functions to filter a DataFrame based on date columns. Here are some examples of how to filter a DataFrame based on different date conditions:

2.1 Filter based on a specific date


// Filter based on a specific date 
import org.apache.spark.sql.functions._

val df = spark.read.format("csv").option("header", "true").load("path/to/file.csv")
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd") === lit("2022-01-01"))

In this example, we are filtering the DataFrame df based on the date_col column which is in the “yyyy-MM-dd” format. We use the to_date function to convert the column to a date type and compare it to a specific date using the lit function.

2.2 Filter based on a date range


// Filter based on a date range
val startDate = "2022-01-01"
val endDate = "2022-01-31"
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd").between(startDate, endDate))

Here, we are filtering the DataFrame df based on the date_col column between two dates, startDate and endDate. We use the to_date function to convert the column to a date type and use the between function to specify the date range.

2.3 Filter based on the current date


// Filter based on the current date
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd") === current_date())

This example filters the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and compares it to the current date using the current_date function.

2.4 Filter based on a date difference


// Filter based on a date difference
val filteredDf = df.filter(datediff(to_date($"date_col", "yyyy-MM-dd"), current_date()) > 30)

In this example, we are filtering the DataFrame df based on the date_col column, which is in the “yyyy-MM-dd” format, and comparing it to the current date using the datediff function. We are filtering the rows where the difference between the date_col and the current date is greater than 30 days.

3. Conclusion

In conclusion, filtering a Spark DataFrame based on date in Scala can be done based on a specific date, a date range, the current date, or a date difference by using the appropriate function with the desired condition.

The to_date function converts a column to a date type so that it can be compared with other dates or used with date functions.

It’s important to ensure that the date column is in a format that can be parsed by the to_date function, otherwise, the filtering may not work as expected.

Related Articles

rimmalapudi

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.