Problem: In PySpark, how to calculate the time/timestamp difference in seconds, minutes, and hours on the DataFrame column?
Solution: PySpark doesn’t have a function to calculate timestamp difference hence we need to calculate to get the difference time unit we want. Below I’ve explained several examples using Pyspark code snippets.
PySpark Timestamp Difference – Date & Time in String Format
Timestamp difference in PySpark can be calculated by using 1) unix_timestamp() to get the Time in seconds and subtract with other time to get the seconds 2) Cast TimestampType column to LongType and subtract two long values to get the difference in seconds, divide it by 60 to get the minute difference and finally divide it by 3600 to get the difference in hours.
Time Difference in Seconds
Using Cast Example: Below I have a DataFrame with a timestamp in string format, let’s convert it to TimestampType (yyyy-MM-dd HH:mm:ss.SSS
) and then calculate the difference between two TimestampType columns.
dates = [("1","2019-07-01 12:01:19.111"),
("2","2019-06-24 12:01:19.222"),
("3","2019-11-16 16:44:55.406"),
("4","2019-11-16 16:50:59.406")
]
df = spark.createDataFrame(data=dates, schema=["id","input_timestamp"])
from pyspark.sql.functions import *
#Calculate Time difference in Seconds
df2=df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
.withColumn('end_timestamp', current_timestamp())\
.withColumn('DiffInSeconds',col("end_timestamp").cast("long") - col('from_timestamp').cast("long"))
df2.show(truncate=False)
Since there is no time diff function, we cast the timestamp column to a long value which gets time in seconds, and then divides it by a value to get appropriate units.
In the above example, I have used withColumn() to change the data type of from_timestamp
column by casting from string to TimestampType and using the same withColumn() we added a new column DiffInSeconds
after calculating the difference.
+---+-----------------------+-----------------------+-------------+
|id |from_timestamp |end_timestamp |DiffInSeconds|
+---+-----------------------+-----------------------+-------------+
|1 |2019-07-01 12:01:19.111|2021-02-26 23:02:49.635|52401690 |
|2 |2019-06-24 12:01:19.222|2021-02-26 23:02:49.635|53006490 |
|3 |2019-11-16 16:44:55.406|2021-02-26 23:02:49.635|40457874 |
|4 |2019-11-16 16:50:59.406|2021-02-26 23:02:49.635|40457510 |
+---+-----------------------+-----------------------+-------------+
Using unix_timestamp() Example: Alternatively, you can get the difference in seconds by using PySpark unix_timestamp()
function
#Getting Time Difference using unix_timestamp()
from pyspark.sql.functions import *
df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
.withColumn('end_timestamp', current_timestamp())\
.withColumn('DiffInSeconds',unix_timestamp("end_timestamp") - unix_timestamp('from_timestamp')) \
.show(truncate=False)
Time Difference in Minutes
Here, I am using the DataFrame created above to derive the Time difference in Minutes from DiffInSeconds
column. You will get minutes by dividing seconds by 60.
df2.withColumn('DiffInMinutes',round(col('DiffInSeconds')/60))\
.show(truncate=False)
#Result
+---+-----------------------+-----------------------+-------------+-------------+
|id |from_timestamp |end_timestamp |DiffInSeconds|DiffInMinutes|
+---+-----------------------+-----------------------+-------------+-------------+
|1 |2019-07-01 12:01:19.111|2021-02-26 23:02:53.663|52401694 |873362.0 |
|2 |2019-06-24 12:01:19.222|2021-02-26 23:02:53.663|53006494 |883442.0 |
|3 |2019-11-16 16:44:55.406|2021-02-26 23:02:53.663|40457878 |674298.0 |
|4 |2019-11-16 16:50:59.406|2021-02-26 23:02:53.663|40457514 |674292.0 |
+---+-----------------------+-----------------------+-------------+-------------+
Time Difference in Hours
Here also, I am using the DataFrame created above to derive the Time difference in Hours from DiffInSeconds
column. You will get hours by dividing seconds by 3600.
df2.withColumn('DiffInHours',round(col('DiffInSeconds')/3600))\
.show(truncate=False)
#Result
+---+-----------------------+-----------------------+-------------+-----------+
|id |from_timestamp |end_timestamp |DiffInSeconds|DiffInHours|
+---+-----------------------+-----------------------+-------------+-----------+
|1 |2019-07-01 12:01:19.111|2021-02-26 23:02:56.163|52401697 |14556.0 |
|2 |2019-06-24 12:01:19.222|2021-02-26 23:02:56.163|53006497 |14724.0 |
|3 |2019-11-16 16:44:55.406|2021-02-26 23:02:56.163|40457881 |11238.0 |
|4 |2019-11-16 16:50:59.406|2021-02-26 23:02:56.163|40457517 |11238.0 |
+---+-----------------------+-----------------------+-------------+-----------+
Time Difference – When Just Time Present
Let’s see another example of the difference between two timestamps when just time present in a column 'HH:mm:ss.SSS'
without a date. when the date does not present along with time then we need to convert date string to timestamp first using a function to_timestamp() before applying any SQL date functions.
#Difference between two timestamps when input has just timestamp
data= [("12:01:19.000","13:01:19.000"),
("12:01:19.000","12:02:19.000"),
("16:44:55.406","17:44:55.406"),
("16:50:59.406","16:44:59.406")]
df3 = spark.createDataFrame(data=data, schema=["from_timestamp","to_timestamp"])
df3.withColumn("from_timestamp",to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS")) \
.withColumn("to_timestamp",to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS")) \
.withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.show(truncate=False)
Yields below output:
+-------------------+-------------------+-------------+-------------+-----------+
|from_timestamp |to_timestamp |DiffInSeconds|DiffInMinutes|DiffInHours|
+-------------------+-------------------+-------------+-------------+-----------+
|1970-01-01 12:01:19|1970-01-01 13:01:19|-3600 |-60.0 |-1.0 |
|1970-01-01 12:01:19|1970-01-01 12:02:19|-60 |-1.0 |0.0 |
|1970-01-01 16:44:55|1970-01-01 17:44:55|-3600 |-60.0 |-1.0 |
|1970-01-01 16:50:59|1970-01-01 16:44:59|360 |6.0 |0.0 |
+-------------------+-------------------+-------------+-------------+-----------+
Date & Time is not in PySpark Default Format
Let’s see another example of the difference between two timestamps when both dates & times present but dates are not in PySpark TimestampType format 'yyyy-MM-dd HH:mm:ss.SSS'
. when dates are not in Spark TimestampType format, all Spark functions return null. Hence, first, convert the input dates to Spark TimestampType using to_timestamp() function.
df3 = spark.createDataFrame(
data=[("1","07-01-2019 12:01:19.406")],
schema=["id","input_timestamp"]
)
df3.withColumn("input_timestamp",to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS")) \
.withColumn("current_timestamp",current_timestamp().alias("current_timestamp")) \
.withColumn("DiffInSeconds",current_timestamp().cast("long") - col("input_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.withColumn("DiffInDays",round(col("DiffInSeconds")/(24*3600))) \
.show(truncate=False)
Output:
+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|id |input_timestamp |current_timestamp |DiffInSeconds|DiffInMinutes|DiffInHours|DiffInDays |
+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+
|1 |2019-07-01 12:01:19.406|2021-02-26 23:25:32.878|52403053 |873384.0 |14556.0 |7.86045795E9|
+---+-----------------------+-----------------------+-------------+-------------+-----------+------------+
SQL Example
Below is a PySpark SQL Example to get the time difference in seconds, minutes, and hours. When using unix_timestamp() with time in the String column make sure you are not using a fraction of second (SSS). By using this you get an error.
spark.sql("select unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19') DiffInSeconds")
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19'))/60 DiffInMinutes")
spark.sql("select (unix_timestamp('2019-07-02 12:01:19') - unix_timestamp('2019-07-01 12:01:19'))/3600 DiffInHours")
Complete Code:
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
dates = [("1","2019-07-01 12:01:19.111"),
("2","2019-06-24 12:01:19.222"),
("3","2019-11-16 16:44:55.406"),
("4","2019-11-16 16:50:59.406")
]
df = spark.createDataFrame(data=dates, schema=["id","from_timestamp"])
from pyspark.sql.functions import *
df2=df.withColumn('from_timestamp',to_timestamp(col('from_timestamp')))\
.withColumn('end_timestamp', current_timestamp())\
.withColumn('DiffInSeconds',col("end_timestamp").cast("long") - col('from_timestamp').cast("long"))
df2.show(truncate=False)
df2.withColumn('DiffInMinutes',round(col('DiffInSeconds')/60))\
.show(truncate=False)
df2.withColumn('DiffInHours',round(col('DiffInSeconds')/3600))\
.show(truncate=False)
#Difference between two timestamps when input has just timestamp
data= [("12:01:19.000","13:01:19.000"),
("12:01:19.000","12:02:19.000"),
("16:44:55.406","17:44:55.406"),
("16:50:59.406","16:44:59.406")]
df3 = spark.createDataFrame(data=data, schema=["from_timestamp","to_timestamp"])
df3.withColumn("from_timestamp",to_timestamp(col("from_timestamp"),"HH:mm:ss.SSS")) \
.withColumn("to_timestamp",to_timestamp(col("to_timestamp"),"HH:mm:ss.SSS")) \
.withColumn("DiffInSeconds", col("from_timestamp").cast("long") - col("to_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.show(truncate=False)
#Date & Time is not in PySpark Default Format
df3 = spark.createDataFrame(
data=[("1","07-01-2019 12:01:19.406")],
schema=["id","input_timestamp"]
)
df3.withColumn("input_timestamp",to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH:mm:ss.SSS")) \
.withColumn("current_timestamp",current_timestamp().alias("current_timestamp")) \
.withColumn("DiffInSeconds",current_timestamp().cast("long") - col("input_timestamp").cast("long")) \
.withColumn("DiffInMinutes",round(col("DiffInSeconds")/60)) \
.withColumn("DiffInHours",round(col("DiffInSeconds")/3600)) \
.withColumn("DiffInDays",round(col("DiffInSeconds")/(24*3600))) \
.show(truncate=False)
Conclusion:
In this tutorial, you have learned how to calculate seconds, minutes, and hours in between two timestamps and PySpark SQL Time functions. You can find more information about these functions at the following blog
Happy Learning !!
Related Articles:
- PySpark – Difference between two dates (days, months, years)
- PySpark SQL – Working with Unix Time | Timestamp
- PySpark to_timestamp() – Convert String to Timestamp type
- PySpark to_date() – Convert String to Date Format
- PySpark – How to Get Current Date & Timestamp
- PySpark SQL Date and Timestamp Functions
- PySpark SQL Types (DataType) with Examples