You are currently viewing Spark Timestamp Difference in seconds, minutes and hours

Problem: How to calculate the timestamp difference in seconds, minutes and hours of the Spark DataFrame column?

Solution: Spark doesn’t have a function to calculate timestamp difference hence we need to calculate to get the difference time unit we want. Below I’ve explained several examples using Scala

Refer to Spark SQL Date and Timestamp Functions for all Date & Time functions.

Spark Timestamp difference – When the time is in a string column

Timestamp difference in Spark can be calculated by casting timestamp column to LongType and by subtracting two long values results in second differences, dividing by 60 results in minute difference and finally dividing seconds by 3600 results difference in hours

In this first example, we have a DataFrame with a timestamp in a StringType column, first, we convert it to TimestampType 'yyyy-MM-dd HH:mm:ss.SSS' and then calculate the difference between two timestamp columns.


import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._
  
//Difference between two timestamps
val data = Seq(("2019-07-01 12:01:19.000"),
    ("2019-06-24 12:01:19.000"),
    ("2019-11-16 16:44:55.406"),
    ("2019-11-16 16:50:59.406"))
val df=data.toDF("input_timestamp")

df.withColumn("input_timestamp",
        to_timestamp(col("input_timestamp")))
  .withColumn("current_timestamp",
        current_timestamp().as("current_timestamp"))
  .withColumn("DiffInSeconds",current_timestamp().cast(LongType) - 
        col("input_timestamp").cast(LongType))
  .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
  .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
  .withColumn("DiffInDays",round(col("DiffInSeconds")/24*3600))
  .show(false)

Since there is no time diff function, we cast timestamp column to a long value which gets time in seconds and then divides by a value to get appropriate units.

In the above example, I have used withColumn() to change the data type of ‘input_timestamp’ column by casting from string to TimestampType using to_timestamp() function and using the same withColumn() we added new columns ‘DiffInSeconds’, ‘DiffInMinutes’ and ‘DiffInHours’ after calculating the difference.


+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|input_timestamp        |current_timestamp      |DiffInSeconds|DiffInMinutes|DiffInHours|DiffInDays  |
+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|2019-07-01 12:01:19    |2019-11-16 19:52:46.223|11955087     |199251.0     |3321.0     |1.79326305E9|
|2019-06-24 12:01:19    |2019-11-16 19:52:46.223|12559887     |209331.0     |3489.0     |1.88398305E9|
|2019-11-16 16:44:55.406|2019-11-16 19:52:46.223|11271        |188.0        |3.0        |1690650.0   |
|2019-11-16 16:50:59.406|2019-11-16 19:52:46.223|10907        |182.0        |3.0        |1636050.0   |
+-----------------------+-----------------------+-------------+-------------+-----------+------------+

Time Difference – When Just Time present without a date

Let’s see another example of the difference between two timestamps when just time present in a column 'HH:mm:ss.SSS' without a date. when the date does not present along with time then we need to convert date string to timestamp first using a function to_timestamp() before applying any SQL date functions.


import org.apache.spark.sql.functions._
//Difference between two timestamps when input has just timestamp
val data1 = Seq(("12:01:19.000","13:01:19.000"),
    ("12:01:19.000","12:02:19.000"),
    ("16:44:55.406","17:44:55.406"),
    ("16:50:59.406","16:44:59.406"))
val df1=data1.toDF("from_timestamp","to_timestamp")

df1.withColumn("from_timestamp",
      to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS"))
   .withColumn("to_timestamp",
      to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS"))
   .withColumn("DiffInSeconds",col("from_timestamp").cast(LongType) - 
      col("to_timestamp").cast(LongType))
   .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
   .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
   .show(false)

Yields below output:


+-------------------+-------------------+-------------+-------------+-----------+
|from_timestamp     |to_timestamp       |DiffInSeconds|DiffInMinutes|DiffInHours|
+-------------------+-------------------+-------------+-------------+-----------+
|1970-01-01 12:01:19|1970-01-01 13:01:19|-3600        |-60.0        |-1.0       |
|1970-01-01 12:01:19|1970-01-01 12:02:19|-60          |-1.0         |0.0        |
|1970-01-01 16:44:55|1970-01-01 17:44:55|-3600        |-60.0        |-1.0       |
|1970-01-01 16:50:59|1970-01-01 16:44:59|360          |6.0          |0.0        |
+-------------------+-------------------+-------------+-------------+-----------+

When Date & Time are not in Spark timestamp format

Let’s see another example of the difference between two timestamps when both dates & times present but dates are not in Spark TimestampType format 'yyyy-MM-dd HH:mm:ss.SSS'. when dates are not in Spark TimestampType format, all Spark functions return null. Hence, first, convert the input dates to Spark TimestampType using to_timestamp function.


import org.apache.spark.sql.functions._
val data2= Seq(("07-01-2019 12:01:19.406"),
    ("06-24-2019 12:01:19.406"),
    ("11-16-2019 16:44:55.406"),
    ("11-16-2019 16:50:59.406"))
val dfDate =data2.toDF("input_timestamp")

dfDate.withColumn("input_timestamp",
          to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS"))
    .withColumn("current_timestamp",
          current_timestamp().as("current_timestamp"))
    .withColumn("DiffInSeconds",
          current_timestamp().cast(LongType) - col("input_timestamp").cast(LongType))
    .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
    .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
    .withColumn("DiffInDays",round(col("DiffInSeconds")/24*3600))
    .show(false)

Output:


+-------------------+-----------------------+-------------+-------------+-----------+-----------+
|input_timestamp    |current_timestamp      |DiffInSeconds|DiffInMinutes|DiffInHours|DiffInDays |
+-------------------+-----------------------+-------------+-------------+-----------+-----------+
|2019-07-01 12:01:19|2019-11-16 19:52:47.246|11955088     |199251.0     |3321.0     |1.7932632E9|
|2019-06-24 12:01:19|2019-11-16 19:52:47.246|12559888     |209331.0     |3489.0     |1.8839832E9|
|2019-11-16 16:44:55|2019-11-16 19:52:47.246|11272        |188.0        |3.0        |1690800.0  |
|2019-11-16 16:50:59|2019-11-16 19:52:47.246|10908        |182.0        |3.0        |1636200.0  |
+-------------------+-----------------------+-------------+-------------+-----------+-----------+

Complete Code:


package com.sparkbyexamples.spark.dataframe.functions.datetime

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.{DataFrame, SparkSession}

object TimestampDiff extends App {

  val spark:SparkSession = SparkSession.builder()
    .master("local")
    .appName("SparkByExamples.com")
    .getOrCreate()
  spark.sparkContext.setLogLevel("ERROR")

  import spark.sqlContext.implicits._

  //Difference between two timestamps
  val df = Seq(("2019-07-01 12:01:19.000"),
    ("2019-06-24 12:01:19.000"),
    ("2019-11-16 16:44:55.406"),
    ("2019-11-16 16:50:59.406")).toDF("input_timestamp")

  df.withColumn("input_timestamp",
        to_timestamp(col("input_timestamp")))
      .withColumn("current_timestamp",
        current_timestamp().as("current_timestamp"))
      .withColumn("DiffInSeconds",
        current_timestamp().cast(LongType) - col("input_timestamp").cast(LongType))
      .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
      .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
      .withColumn("DiffInDays",round(col("DiffInSeconds")/24*3600)
  ).show(false)

  //Difference between two timestamps when input has just timestamp
  val df1 = Seq(("12:01:19.000","13:01:19.000"),
    ("12:01:19.000","12:02:19.000"),
    ("16:44:55.406","17:44:55.406"),
    ("16:50:59.406","16:44:59.406"))
    .toDF("from_timestamp","to_timestamp")

  df1.withColumn("from_timestamp",
    to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS"))
    .withColumn("to_timestamp",
      to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS"))
    .withColumn("DiffInSeconds",
      col("from_timestamp").cast(LongType) - col("to_timestamp").cast(LongType))
    .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
    .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
    .show(false)

  //Difference between two dates when dates are not in Spark DateType format 'yyyy-MM-dd  HH:mm:ss.SSS'.
  //Note that when dates are not in Spark DateType format, all Spark functions returns null
  //Hence, first convert the input dates to Spark DateType using to_timestamp function
  val dfDate = Seq(("07-01-2019 12:01:19.406"),
    ("06-24-2019 12:01:19.406"),
    ("11-16-2019 16:44:55.406"),
    ("11-16-2019 16:50:59.406")).toDF("input_timestamp")

  dfDate.withColumn("input_timestamp",
          to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS"))
    .withColumn("current_timestamp",
          current_timestamp().as("current_timestamp"))
    .withColumn("DiffInSeconds",
          current_timestamp().cast(LongType) - col("input_timestamp").cast(LongType))
    .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60))
    .withColumn("DiffInHours",round(col("DiffInSeconds")/3600))
    .withColumn("DiffInDays",round(col("DiffInSeconds")/24*3600))
    .show(false)
}

Conclusion:

In this tutorial, you have learned how to calculate seconds, minutes and hours in between two timestamps using Scala language and Spark SQL Time functions. You can find more information about these functions at the following blog

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 3 Comments

  1. NNK

    Hi Cristian, I will work on getting the time difference in milliseconds. Thanks for your suggestion. Hope you are liking articles on this blog.

  2. Cristian

    I need to calculate the time difference in milliseconds, and that is a usual use case for many others. It would be very good to supplement your tutorial with this case.

  3. Anonymous

    Hi , how to calculate timestamp difference of consecutive rows of same column

    ex:09/20/2020 10:01
    09/20/2020 10:05

    output:
    null
    05

Comments are closed.