PySpark – Difference between two dates (days, months, years)

Spark Difference between two dates

Using PySpark SQL functions datediff(), months_between() you can calculate the difference between two dates in days, months, and year, let’s see this by using a DataFrame example. You can also use these to calculate age.

datediff() Function

First Let’s see getting the difference between two dates using datediff() PySpark function.


from pyspark.sql.functions import *
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]
df=spark.createDataFrame(data=data,schema=["id","date"])

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()

Output:


+----------+------------+--------+
|      date|current_date|datediff|
+----------+------------+--------+
|2019-07-01|  2021-02-26|     606|
|2019-06-24|  2021-02-26|     613|
|2019-08-24|  2021-02-26|     552|
+----------+------------+--------+

months_between() Function

Now, Let’s see how to get month and year differences between two dates using months_between() function.


from pyspark.sql.functions import *
df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
  .withColumn("montsDiff", months_between(current_date(),col("date"))) \
  .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
  .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
  .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
  .show()

Yields below output. Note that here we use round() function and lit() functions on top of months_between() to get the year between two dates.


+---+----------+---------+-----------+---------------+------------------+---------------+
| id|      date|datesDiff|  montsDiff|montsDiff_round|         yearsDiff|yearsDiff_round|
+---+----------+---------+-----------+---------------+------------------+---------------+
|  1|2019-07-01|      606|19.80645161|          19.81|1.6505376341666667|           1.65|
|  2|2019-06-24|      613|20.06451613|          20.06|1.6720430108333335|           1.67|
|  3|2019-08-24|      552|18.06451613|          18.06|1.5053763441666668|           1.51|
+---+----------+---------+-----------+---------------+------------------+---------------+

Let’s see another example of the difference between two dates when dates are not in PySpark DateType format yyyy-MM-dd. when dates are not in DateType format, all date functions return null. Hence, you need to first convert the input date to Spark DateType using to_date() function.


from pyspark.sql.functions import *
data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

SQL Example

Let’s see how to calculate the difference between two dates in years using PySpark SQL example. similarly you can calculate the days and months between two dates.


spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Complete Code:


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()
data = [("1","2019-07-01"),("2","2019-06-24"),("3","2019-08-24")]

df=spark.createDataFrame(data=data,schema=["id","date"])

from pyspark.sql.functions import *

df.select(
      col("date"),
      current_date().alias("current_date"),
      datediff(current_date(),col("date")).alias("datediff")
    ).show()

df.withColumn("datesDiff", datediff(current_date(),col("date"))) \
      .withColumn("montsDiff", months_between(current_date(),col("date"))) \
      .withColumn("montsDiff_round",round(months_between(current_date(),col("date")),2)) \
      .withColumn("yearsDiff",months_between(current_date(),col("date"))/lit(12)) \
      .withColumn("yearsDiff_round",round(months_between(current_date(),col("date"))/lit(12),2)) \
      .show()

data2 = [("1","07-01-2019"),("2","06-24-2019"),("3","08-24-2019")]  
df2=spark.createDataFrame(data=data2,schema=["id","date"])
df2.select(
    to_date(col("date"),"MM-dd-yyyy").alias("date"),
    current_date().alias("endDate")
    )

spark.sql("select round(months_between('2019-07-01',current_date())/12,2) as years_diff").show()

Conclusion:

In this tutorial, you have learned how to calculate days, months, and years between two dates using PySpark Date and Time functions datediff(), months_between(). You can find more information about these functions at the following blog

Happy Learning !!

Related Articles:

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing PySpark – Difference between two dates (days, months, years)