PySpark – Difference between two dates (days, months, years)

  • Post author:
  • Post category:PySpark
  • Post last modified:January 9, 2023
  • Reading time:6 mins read
Spark Difference between two dates

Using PySpark SQL functions datediff(), months_between() you can calculate the difference between two dates in days, months, and year, let’s see this by using a DataFrame example. You can also use these to calculate age.

datediff() Function

First Let’s see getting the difference between two dates using datediff() PySpark function.


from pyspark.sql.functions import *
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date"])

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()

Output:


+----------+------------+--------+
|      date|current_date|datediff|
+----------+------------+--------+
|2019-07-01|  2021-02-26|     606|
|2019-06-24|  2021-02-26|     613|
|2019-08-24|  2021-02-26|     552|
+----------+------------+--------+

months_between() Function

Now, Let’s see how to get month and year differences between two dates using months_between() function.


from pyspark.sql.functions import *
df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
  .withColumn("montsDiff", months_between(current_date(),col("date"))) \
  .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
  .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
  .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
  .show()

Yields below output. Note that here we use round() function and lit() functions on top of months_between() to get the year between two dates.


+---+----------+---------+-----------+---------------+------------------+---------------+
| id|      date|datesDiff|  montsDiff|montsDiff_round|         yearsDiff|yearsDiff_round|
+---+----------+---------+-----------+---------------+------------------+---------------+
|  1|2019-07-01|      606|19.80645161|          19.81|1.6505376341666667|           1.65|
|  2|2019-06-24|      613|20.06451613|          20.06|1.6720430108333335|           1.67|
|  3|2019-08-24|      552|18.06451613|          18.06|1.5053763441666668|           1.51|
+---+----------+---------+-----------+---------------+------------------+---------------+

Let’s see another example of the difference between two dates when dates are not in PySpark DateType format yyyy-MM-dd. when dates are not in DateType format, all date functions return null. Hence, you need to first convert the input date to Spark DateType using to_date() function.


from pyspark.sql.functions import *
data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

SQL Example

Let’s see how to calculate the difference between two dates in years using PySpark SQL example. similarly you can calculate the days and months between two dates.


spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Complete Code:


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]

df=spark.createDataFrame(data=data,schema=["id","date"])

from pyspark.sql.functions import *

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()

df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
      .withColumn("montsDiff", months_between(current_date(),col("date"))) \
      .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
      .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
      .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
      .show()

data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Conclusion:

In this tutorial, you have learned how to calculate days, months, and years between two dates using PySpark Date and Time functions datediff(), months_between(). You can find more information about these functions at the following blog

Happy Learning !!

Related Articles:

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply