• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:11 mins read
You are currently viewing PySpark Timestamp Difference (seconds, minutes, hours)

Problem: In PySpark, how to calculate the time/timestamp difference in seconds, minutes, and hours on the DataFrame column?

Advertisements

Solution: PySpark doesn’t have a function to calculate timestamp difference hence we need to calculate to get the difference time unit we want. Below I’ve explained several examples using Pyspark code snippets.

PySpark Timestamp Difference – Date & Time in String Format

Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally divide it by 3600 to get the difference in hours.

Time Difference in Seconds

Using Cast Example: Below I have a DataFrame with a timestamp in string format, let’s convert it to TimestampType (yyyy-MM-dd HH:mm:ss.SSS) and then calculate the difference between two TimestampType columns.


dates = [("1","2019-07-01 12:01:19.111"),
    ("2","2019-06-24 12:01:19.222"),
    ("3","2019-11-16 16:44:55.406"),
    ("4","2019-11-16 16:50:59.406")
    ]

df = spark.createDataFrame(data=dates, schema=["id","input_timestamp"])

from pyspark.sql.functions import *

#Calculate Time difference in Seconds
df2=df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
  .withColumn('end_timestamp', current_timestamp())\
  .withColumn('DiffInSeconds',col("end_timestamp").cast("long") - col('from_timestamp').cast("long"))
df2.show(truncate=False)

Since there is no time diff function, we cast the timestamp column to a long value which gets time in seconds, and then divides it by a value to get appropriate units.

In the above example, I have used withColumn() to change the data type of from_timestamp column by casting from string to TimestampType and using the same withColumn() we added a new column DiffInSeconds after calculating the difference.


+---+-----------------------+-----------------------+-------------+
|id |from_timestamp         |end_timestamp          |DiffInSeconds|
+---+-----------------------+-----------------------+-------------+
|1  |2019-07-01 12:01:19.111|2021-02-26 23:02:49.635|52401690     |
|2  |2019-06-24 12:01:19.222|2021-02-26 23:02:49.635|53006490     |
|3  |2019-11-16 16:44:55.406|2021-02-26 23:02:49.635|40457874     |
|4  |2019-11-16 16:50:59.406|2021-02-26 23:02:49.635|40457510     |
+---+-----------------------+-----------------------+-------------+

Using unix_timestamp() Example: Alternatively, you can get the difference in seconds by using PySpark unix_timestamp() function


#Getting Time Difference using unix_timestamp()
from pyspark.sql.functions import *
df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
  .withColumn('end_timestamp', current_timestamp())\
  .withColumn('DiffInSeconds',unix_timestamp("end_timestamp") - unix_timestamp('from_timestamp')) \
  .show(truncate=False)

Time Difference in Minutes

Here, I am using the DataFrame created above to derive the Time difference in Minutes from DiffInSeconds column. You will get minutes by dividing seconds by 60.


df2.withColumn('DiffInMinutes',round(col('DiffInSeconds')/60))\
  .show(truncate=False)

#Result
+---+-----------------------+-----------------------+-------------+-------------+
|id |from_timestamp         |end_timestamp          |DiffInSeconds|DiffInMinutes|
+---+-----------------------+-----------------------+-------------+-------------+
|1  |2019-07-01 12:01:19.111|2021-02-26 23:02:53.663|52401694     |873362.0     |
|2  |2019-06-24 12:01:19.222|2021-02-26 23:02:53.663|53006494     |883442.0     |
|3  |2019-11-16 16:44:55.406|2021-02-26 23:02:53.663|40457878     |674298.0     |
|4  |2019-11-16 16:50:59.406|2021-02-26 23:02:53.663|40457514     |674292.0     |
+---+-----------------------+-----------------------+-------------+-------------+

Time Difference in Hours

Here also, I am using the DataFrame created above to derive the Time difference in Hours from DiffInSeconds column. You will get hours by dividing seconds by 3600.


df2.withColumn('DiffInHours',round(col('DiffInSeconds')/3600))\
  .show(truncate=False)

#Result
+---+-----------------------+-----------------------+-------------+-----------+
|id |from_timestamp         |end_timestamp          |DiffInSeconds|DiffInHours|
+---+-----------------------+-----------------------+-------------+-----------+
|1  |2019-07-01 12:01:19.111|2021-02-26 23:02:56.163|52401697     |14556.0    |
|2  |2019-06-24 12:01:19.222|2021-02-26 23:02:56.163|53006497     |14724.0    |
|3  |2019-11-16 16:44:55.406|2021-02-26 23:02:56.163|40457881     |11238.0    |
|4  |2019-11-16 16:50:59.406|2021-02-26 23:02:56.163|40457517     |11238.0    |
+---+-----------------------+-----------------------+-------------+-----------+

Time Difference – When Just Time Present

Let’s see another example of the difference between two timestamps when just time present in a column 'HH:mm:ss.SSS' without a date. when the date does not present along with time then we need to convert date string to timestamp first using a function to_timestamp() before applying any SQL date functions.


#Difference between two timestamps when input has just timestamp
data= [("12:01:19.000","13:01:19.000"),
    ("12:01:19.000","12:02:19.000"),
    ("16:44:55.406","17:44:55.406"),
    ("16:50:59.406","16:44:59.406")]
df3 = spark.createDataFrame(data=data, schema=["from_timestamp","to_timestamp"])

df3.withColumn("from_timestamp",to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS")) \
   .withColumn("to_timestamp",to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS")) \
   .withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
   .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
   .withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
   .show(truncate=False)

Yields below output:


+-------------------+-------------------+-------------+-------------+-----------+
|from_timestamp     |to_timestamp       |DiffInSeconds|DiffInMinutes|DiffInHours|
+-------------------+-------------------+-------------+-------------+-----------+
|1970-01-01 12:01:19|1970-01-01 13:01:19|-3600        |-60.0        |-1.0       |
|1970-01-01 12:01:19|1970-01-01 12:02:19|-60          |-1.0         |0.0        |
|1970-01-01 16:44:55|1970-01-01 17:44:55|-3600        |-60.0        |-1.0       |
|1970-01-01 16:50:59|1970-01-01 16:44:59|360          |6.0          |0.0        |
+-------------------+-------------------+-------------+-------------+-----------+

Date & Time is not in PySpark Default Format

Let’s see another example of the difference between two timestamps when both dates & times present but dates are not in PySpark TimestampType format 'yyyy-MM-dd HH:mm:ss.SSS'. when dates are not in Spark TimestampType format, all Spark functions return null. Hence, first, convert the input dates to Spark TimestampType using to_timestamp() function.


df3 = spark.createDataFrame(
        data=[("1","07-01-2019 12:01:19.406")], 
        schema=["id","input_timestamp"]
        )
df3.withColumn("input_timestamp",to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS")) \
    .withColumn("current_timestamp",current_timestamp().alias("current_timestamp")) \
    .withColumn("DiffInSeconds",current_timestamp().cast("long") - col("input_timestamp").cast("long")) \
    .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
    .withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
    .withColumn("DiffInDays",round(col("DiffInSeconds")/(24*3600))) \
    .show(truncate=False)

Output:


+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|id |input_timestamp        |current_timestamp      |DiffInSeconds|DiffInMinutes|DiffInHours|DiffInDays  |
+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|1  |2019-07-01 12:01:19.406|2021-02-26 23:25:32.878|52403053     |873384.0     |14556.0    |7.86045795E9|
+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+

SQL Example

Below is a PySpark SQL Example to get the time difference in seconds, minutes, and hours. When using unix_timestamp() with time in the String column make sure you are not using a fraction of second (SSS). By using this you get an error.


spark.sql("select unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19') DiffInSeconds")
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19'))/60 DiffInMinutes")
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19'))/3600 DiffInHours")

Complete Code:


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()


dates = [("1","2019-07-01 12:01:19.111"),
    ("2","2019-06-24 12:01:19.222"),
    ("3","2019-11-16 16:44:55.406"),
    ("4","2019-11-16 16:50:59.406")
    ]

df = spark.createDataFrame(data=dates, schema=["id","from_timestamp"])

from pyspark.sql.functions import *
df2=df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
  .withColumn('end_timestamp', current_timestamp())\
  .withColumn('DiffInSeconds',col("end_timestamp").cast("long") - col('from_timestamp').cast("long"))
df2.show(truncate=False)

df2.withColumn('DiffInMinutes',round(col('DiffInSeconds')/60))\
  .show(truncate=False)
  
df2.withColumn('DiffInHours',round(col('DiffInSeconds')/3600))\
  .show(truncate=False)
  
#Difference between two timestamps when input has just timestamp

data= [("12:01:19.000","13:01:19.000"),
    ("12:01:19.000","12:02:19.000"),
    ("16:44:55.406","17:44:55.406"),
    ("16:50:59.406","16:44:59.406")]
df3 = spark.createDataFrame(data=data, schema=["from_timestamp","to_timestamp"])

df3.withColumn("from_timestamp",to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS")) \
   .withColumn("to_timestamp",to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS")) \
   .withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
   .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
   .withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
   .show(truncate=False)
   
#Date & Time is not in PySpark Default Format

df3 = spark.createDataFrame(
        data=[("1","07-01-2019 12:01:19.406")], 
        schema=["id","input_timestamp"]
        )
df3.withColumn("input_timestamp",to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS")) \
    .withColumn("current_timestamp",current_timestamp().alias("current_timestamp")) \
    .withColumn("DiffInSeconds",current_timestamp().cast("long") - col("input_timestamp").cast("long")) \
    .withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
    .withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
    .withColumn("DiffInDays",round(col("DiffInSeconds")/(24*3600))) \
    .show(truncate=False)

Conclusion:

In this tutorial, you have learned how to calculate seconds, minutes, and hours in between two timestamps and PySpark SQL Time functions. You can find more information about these functions at the following blog

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium