Spark SQL provides datediff() function to get the difference between two timestamps/dates. In this article, Let us see a Spark SQL Dataframe example of how to calculate a Datediff between two dates in seconds, minutes, hours, days, and months using Scala language and functions like datediff()
, unix_timestamp(), to_timestamp(), months_between()
.
Table of contents
Before going into the topic, let us create a sample Spark SQL DataFrame holding the date related data for our demo purpose.
// Create dataframe
val data = Seq(("2022-11-01"),("2022-12-01"),("2022-12-02"),("2022-12-03"))
val df = data.toDF("date").select(
col("date"),
current_date().as("current_date"),
)
df.show()
//Output
+----------+------------+
| date|current_date|
+----------+------------+
|2022-11-01| 2022-12-07|
|2022-12-01| 2022-12-07|
|2022-12-02| 2022-12-07|
|2022-12-03| 2022-12-07|
+----------+------------+
Function current_date() is used to return the current date at the start of query evaluation. The code snippets used in this article work both in your local workspace and in Databricks.
1. Spark SQL datadiff() – Date Difference in Days.
The Spark SQL datediff() function is used to get the date difference between two dates in terms of DAYS. This function takes the end date as the first argument and the start date as the second argument and returns the number of days in between them.
# datediff() syntax
datediff(endDate, startDate)
Let’s use the above DataFrame and run with an example.
Example:
# datediff() example
import org.apache.spark.sql.functions._
df.withColumn("diff_in_days", datediff(col("current_date"), col("date")))
.show()
//Output
+----------+------------+------------+
| date|current_date|diff_in_days|
+----------+------------+------------+
|2022-11-01| 2022-12-07| 36|
|2022-12-01| 2022-12-07| 6|
|2022-12-02| 2022-12-07| 5|
|2022-12-03| 2022-12-07| 4|
+----------+------------+------------+
Note: When using Spark datediff() for date difference, we should make sure to specify the greater or max date as first (endDate) followed by the lesser or minimum date (startDate). If not you will end up with a negative difference as below. Let’s see with an example.
import org.apache.spark.sql.functions._
df.withColumn("diff_in_days", datediff(col("date"),col("current_date")))
.show()
//Output
+----------+------------+------------+
| date|current_date|diff_in_days|
+----------+------------+------------+
|2022-11-01| 2022-12-07| -36|
|2022-12-01| 2022-12-07| -6|
|2022-12-02| 2022-12-07| -5|
|2022-12-03| 2022-12-07| -4|
+----------+------------+------------+
2. Date Difference in Months.
Spark SQL provides the months_between() function to calculate the Datediff between the dates the StartDate and EndDate in terms of Months
Syntax: months_between(timestamp1, timestamp2)
Example:
import org.apache.spark.sql.functions._
df.withColumn("monthsInbetween", months_between(col("current_date"), col("date")))
.show()
//output
+----------+------------+---------------+
| date|current_date|monthsInbetween|
+----------+------------+---------------+
|2022-11-01| 2022-12-07| 1.19354839|
|2022-12-01| 2022-12-07| 0.19354839|
|2022-12-02| 2022-12-07| 0.16129032|
|2022-12-03| 2022-12-07| 0.12903226|
+----------+------------+---------------+
Note: Spark SQL months_between() provides the difference between the dates as the number of months between the two timestamps based on 31 days in a month.
3. Date Difference in Seconds.
3.1 Using unix_timestamp()
We use the unix_timestamp() function in Spark SQL to convert Date/Datetime into seconds and then calculate the difference between dates in terms of seconds.
Syntax:unix_timestamp(timestamp, TimestampFormat)
Note: The UNIX timestamp function converts the timestamp into the number of seconds since the first of January 1970.
Example:
import org.apache.spark.sql.functions._
df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
.withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
.withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
.show()
//output
+----------+------------+---------------------------+-------------------+---------------+
| date|current_date|unix_timestamp_current_date|unix_timestamp_date|seconds_between|
+----------+------------+---------------------------+-------------------+---------------+
|2022-11-01| 2022-12-07| 1670371200| 1667260800| 3110400|
|2022-12-01| 2022-12-07| 1670371200| 1669852800| 518400|
|2022-12-02| 2022-12-07| 1670371200| 1669939200| 432000|
|2022-12-03| 2022-12-07| 1670371200| 1670025600| 345600|
+----------+------------+---------------------------+-------------------+---------------+
3.2. Using to_timestamp()
We can also get the difference between the dates in terms of seconds using to_timestamp()
function.
Syntax: to_timestamp(timestamp, format])
Examples:
import org.apache.spark.sql.functions._
df.withColumn("seconds_between_to_timestamp", (to_timestamp(col("current_date")) - to_timestamp(col("date"))).cast("bigInt"))
.show()
//output
+----------+------------+----------------------------+
| date|current_date|seconds_between_to_timestamp|
+----------+------------+----------------------------+
|2022-11-01| 2022-12-07| 3110400|
|2022-12-01| 2022-12-07| 518400|
|2022-12-02| 2022-12-07| 432000|
|2022-12-03| 2022-12-07| 345600|
+----------+------------+----------------------------+
4. Date Difference in Minutes.
Here we use the same Spark SQL unix_timestamp to calculate the difference in seconds and then convert the respective difference into MINUTES.
import org.apache.spark.sql.functions._
df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
.withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
.withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
.withColumn("minutes_between", col("seconds_between").cast("bigint")/60)
.select("current_date", "Date", "minutes_between")
.show()
//output
+------------+----------+---------------+
|current_date| Date|minutes_between|
+------------+----------+---------------+
| 2022-12-07|2022-11-01| 51840.0|
| 2022-12-07|2022-12-01| 8640.0|
| 2022-12-07|2022-12-02| 7200.0|
| 2022-12-07|2022-12-03| 5760.0|
+------------+----------+---------------+
5. Date Difference in Hours.
Here we use the same Spark SQL unix_timestamp() to calculate the difference in minutes and then convert the respective difference into HOURS.
import org.apache.spark.sql.functions._
df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
.withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
.withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
.withColumn("minutes_between", col("seconds_between").cast("bigint")/60)
.withColumn("hours_between", col("minutes_between").cast("bigint")/60)
.select("current_date", "Date", "hours_between")
.show()
//output
+------------+----------+-------------+
|current_date| Date|hours_between|
+------------+----------+-------------+
| 2022-12-07|2022-11-01| 864.0|
| 2022-12-07|2022-12-01| 144.0|
| 2022-12-07|2022-12-02| 120.0|
| 2022-12-07|2022-12-03| 96.0|
+------------+----------+-------------+
Conclusion
In this article, you have learned Spark SQL datediff() and many other functions to calculate date differences. Here in this article, we have explained the most used functions to calculate the difference in terms of Months, Days, Seconds, Minutes, and Hours.
Related Articles
- Spark to_date() – Convert String to Date format
- Spark SQL Date and Timestamp Functions
- Spark date_format() – Convert Date to String format
- Spark convert Unix timestamp (seconds) to Date
- Spark SQL – Add Day, Month, and Year to Date
- Spark Epoch time to timestamp and Date
- Spark Most Used JSON Functions with Examples
- Cannot call methods on a stopped SparkContext in Spark
- Tune Spark Executor Number, Cores, and Memory
- Calculate difference between two dates in days, months and years