How to Filter Spark DataFrame based on date? By using filter() function you can easily perform filtering dataframe based on date. In this article, I will explain how to filter based on a date with various examples.
1. Filter Syntax
The filter() function can be used to select a subset of data from a DataFrame or Dataset based on a condition. In Scala, you can use the filter
method to apply a filter to a DataFrame or Dataset.
The basic syntax of the filter
operation in Scala is as follows:
// Filter Syntax
val filteredDF = originalDF.filter(condition)
where originalDF
is the DataFrame you want to filter, condition
is a Boolean expression that specifies the filtering condition, and filteredDF
is the resulting DataFrame that contains only the rows that satisfy the filtering condition.
2. Examples of Date based filtering
You can use Spark SQL functions to filter a DataFrame based on date columns. Here are some examples of how to filter a DataFrame based on different date conditions:
2.1 Filter based on a specific date
// Filter based on a specific date
import org.apache.spark.sql.functions._
val df = spark.read.format("csv").option("header", "true").load("path/to/file.csv")
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd") === lit("2022-01-01"))
In this example, we are filtering the DataFrame df
based on the date_col
column which is in the “yyyy-MM-dd” format. We use the to_date
function to convert the column to a date type and compare it to a specific date using the lit
function.
2.2 Filter based on a date range
// Filter based on a date range
val startDate = "2022-01-01"
val endDate = "2022-01-31"
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd").between(startDate, endDate))
Here, we are filtering the DataFrame df
based on the date_col
column between two dates, startDate
and endDate
. We use the to_date
function to convert the column to a date type and use the between
function to specify the date range.
2.3 Filter based on the current date
// Filter based on the current date
val filteredDf = df.filter(to_date($"date_col", "yyyy-MM-dd") === current_date())
This example filters the DataFrame df
based on the date_col
column, which is in the “yyyy-MM-dd” format, and compares it to the current date using the current_date
function.
2.4 Filter based on a date difference
// Filter based on a date difference
val filteredDf = df.filter(datediff(to_date($"date_col", "yyyy-MM-dd"), current_date()) > 30)
In this example, we are filtering the DataFrame df
based on the date_col
column, which is in the “yyyy-MM-dd” format, and comparing it to the current date using the datediff
function. We are filtering the rows where the difference between the date_col
and the current date is greater than 30 days.
3. Conclusion
In conclusion, filtering a Spark DataFrame based on date in Scala can be done based on a specific date, a date range, the current date, or a date difference by using the appropriate function with the desired condition.
The to_date function converts a column to a date type so that it can be compared with other dates or used with date functions.
It’s important to ensure that the date column is in a format that can be parsed by the to_date function, otherwise, the filtering may not work as expected.
Related Articles
- Spark RDD filter() with examples
- Spark date_format() – Convert Timestamp to String
- Spark SQL datediff()
- Spark to_date() – Convert timestamp to date
- Why Spark RDDs are immutable?
- Create DataFrame from Scala List of Iterables
- Spark – How to create an empty DataFrame?
- Spark Query Table using JDBC
- Create Java DataFrame in Spark