Spark SQL datediff()

Spark & PySpark SQL provides datediff() function to get the difference between two dates. In this article, Let us see a Spark SQL Dataframe example of how to calculate a Datediff between two dates in seconds, minutes, hours, days, and months using Scala language and functions like datediff()unix_timestamp(), to_timestamp(), months_between().

Before going into the topic, let us create a sample Spark SQL DataFrame holding the date related data for our demo purpose.


// Create dataframe
val data = Seq(("2022-11-01"),("2022-12-01"),("2022-12-02"),("2022-12-03"))
val df = data.toDF("date").select(
      col("date"),
      current_date().as("current_date"),
    )
df.show()

//Output
+----------+------------+
|      date|current_date|
+----------+------------+
|2022-11-01|  2022-12-07|
|2022-12-01|  2022-12-07|
|2022-12-02|  2022-12-07|
|2022-12-03|  2022-12-07|
+----------+------------+

Function current_date() is used to return the current date at the start of query evaluation. The code snippets used in this article work both in your local workspace and in Databricks.

1. Spark SQL datadiff() – Date Difference in Days.

The Spark SQL datediff() function is used to get the date difference between two dates in terms of DAYS. This function takes the end date as the first argument and the start date as the second argument and returns the number of days in between them.


# datediff() syntax
datediff(endDate, startDate)

Let’s use the above DataFrame and run with an example.

Example:


# datediff() example
import org.apache.spark.sql.functions._
df.withColumn("diff_in_days", datediff(col("current_date"), col("date")))
  .show()

//Output
+----------+------------+------------+
|      date|current_date|diff_in_days|
+----------+------------+------------+
|2022-11-01|  2022-12-07|          36|
|2022-12-01|  2022-12-07|           6|
|2022-12-02|  2022-12-07|           5|
|2022-12-03|  2022-12-07|           4|
+----------+------------+------------+

Note: When using Spark datediff() for date difference, we should make sure to specify the greater or max date as first (endDate) followed by the lesser or minimum date (startDate). If not you will end up with a negative difference as below. Let’s see with an example.


import org.apache.spark.sql.functions._
df.withColumn("diff_in_days", datediff(col("date"),col("current_date")))
  .show()

//Output
+----------+------------+------------+
|      date|current_date|diff_in_days|
+----------+------------+------------+
|2022-11-01|  2022-12-07|         -36|
|2022-12-01|  2022-12-07|          -6|
|2022-12-02|  2022-12-07|          -5|
|2022-12-03|  2022-12-07|          -4|
+----------+------------+------------+

2. Date Difference in Months.

Spark SQL provides the months_between() function to calculate the Datediff between the dates the StartDate and EndDate in terms of Months

Syntax: months_between(timestamp1, timestamp2)

Example:


import org.apache.spark.sql.functions._

df.withColumn("monthsInbetween", months_between(col("current_date"), col("date")))
  .show()
//output
+----------+------------+---------------+
|      date|current_date|monthsInbetween|
+----------+------------+---------------+
|2022-11-01|  2022-12-07|     1.19354839|
|2022-12-01|  2022-12-07|     0.19354839|
|2022-12-02|  2022-12-07|     0.16129032|
|2022-12-03|  2022-12-07|     0.12903226|
+----------+------------+---------------+

Note: Spark SQL months_between() provides the difference between the dates as the number of months between the two timestamps based on 31 days in a month.

3. Date Difference in Seconds.

3.1 Using unix_timestamp()

We use the unix_timestamp() function in Spark SQL to convert Date/Datetime into seconds and then calculate the difference between dates in terms of seconds.

Syntax:unix_timestamp(timestamp, TimestampFormat)

Note: The UNIX timestamp function converts the timestamp into the number of seconds since the first of January 1970.

Example:


import org.apache.spark.sql.functions._

df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
  .withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
  .withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
  .show()
//output
+----------+------------+---------------------------+-------------------+---------------+
|      date|current_date|unix_timestamp_current_date|unix_timestamp_date|seconds_between|
+----------+------------+---------------------------+-------------------+---------------+
|2022-11-01|  2022-12-07|                 1670371200|         1667260800|        3110400|
|2022-12-01|  2022-12-07|                 1670371200|         1669852800|         518400|
|2022-12-02|  2022-12-07|                 1670371200|         1669939200|         432000|
|2022-12-03|  2022-12-07|                 1670371200|         1670025600|         345600|
+----------+------------+---------------------------+-------------------+---------------+

3.2. Using to_timestamp()

We can also get the difference between the dates in terms of seconds using to_timestamp() function.

Syntax: to_timestamp(timestamp, format])

Examples:


import org.apache.spark.sql.functions._

df.withColumn("seconds_between_to_timestamp", (to_timestamp(col("current_date")) - to_timestamp(col("date"))).cast("bigInt"))
  .show()
//output
+----------+------------+----------------------------+
|      date|current_date|seconds_between_to_timestamp|
+----------+------------+----------------------------+
|2022-11-01|  2022-12-07|                     3110400|
|2022-12-01|  2022-12-07|                      518400|
|2022-12-02|  2022-12-07|                      432000|
|2022-12-03|  2022-12-07|                      345600|
+----------+------------+----------------------------+

4. Date Difference in Minutes.

Here we use the same Spark SQL unix_timestamp to calculate the difference in seconds and then convert the respective difference into MINUTES.


import org.apache.spark.sql.functions._

df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
  .withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
  .withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
  .withColumn("minutes_between", col("seconds_between").cast("bigint")/60)
  .select("current_date", "Date", "minutes_between")
  .show()

//output
+------------+----------+---------------+
|current_date|      Date|minutes_between|
+------------+----------+---------------+
|  2022-12-07|2022-11-01|        51840.0|
|  2022-12-07|2022-12-01|         8640.0|
|  2022-12-07|2022-12-02|         7200.0|
|  2022-12-07|2022-12-03|         5760.0|
+------------+----------+---------------+

5. Date Difference in Hours.

Here we use the same Spark SQL unix_timestamp() to calculate the difference in minutes and then convert the respective difference into HOURS.


import org.apache.spark.sql.functions._

df.withColumn("unix_timestamp_current_date", unix_timestamp(col("current_date")))
  .withColumn("unix_timestamp_date", unix_timestamp(col("date").cast("Date")))
  .withColumn("seconds_between", unix_timestamp(col("current_date")) - unix_timestamp(col("date").cast("Date")))
  .withColumn("minutes_between", col("seconds_between").cast("bigint")/60)
  .withColumn("hours_between", col("minutes_between").cast("bigint")/60)
  .select("current_date", "Date", "hours_between")
  .show()

//output
+------------+----------+-------------+
|current_date|      Date|hours_between|
+------------+----------+-------------+
|  2022-12-07|2022-11-01|        864.0|
|  2022-12-07|2022-12-01|        144.0|
|  2022-12-07|2022-12-02|        120.0|
|  2022-12-07|2022-12-03|         96.0|
+------------+----------+-------------+

Conclusion

In this article, you have learned Spark SQL datediff() and many other functions to calculate date differences. Here in this article, we have explained the most used functions to calculate the difference in terms of Months, Days, Seconds, Minutes, and Hours.

Leave a Reply