Spark SQL Date and Timestamp Functions

Spark SQL provides built-in standard Date and Timestamp (includes date and time) Functions defines in DataFrame API, these come in handy when we need to make operations on date and time. All these accept input as, Date type, Timestamp type or String. If a String, it should be in a format that can be cast to date, such as yyyy-MM-dd and timestamp in yyyy-MM-dd HH:mm:ss.SSSS and returns date and timestamp respectively; also returns null if the input data was a string that could not be cast to date and timestamp.

When possible try to leverage standard library as they are a little bit more compile-time safe, handles null, and perform better when compared to Spark UDF. If your application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee performance.

For the readable purpose, I’ve grouped Date and Timestamp functions into the following.

Before you use any examples below, make sure you create sparksession and import SQL functions.


import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExample")
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._

Spark SQL Date Functions

Click on each link from below table for more explanation and working examples in Scala.

Date Function SignatureDate Function Description
current_date () : ColumnReturns the current date as a date column.
date_format(dateExpr: Column, format: String): ColumnConverts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
to_date(e: Column): ColumnConverts the column into DateType by casting rules to DateType.
to_date(e: Column, fmt: String): ColumnConverts the column into a DateType with a specified format
add_months(startDate: Column, numMonths: Int): ColumnReturns the date that is numMonths after startDate.
date_add(start: Column, days: Int): Column
date_sub(start: Column, days: Int): Column
Returns the date that is days days after start
datediff(end: Column, start: Column): ColumnReturns the number of days from start to end.
months_between(end: Column, start: Column): ColumnReturns number of months between dates start and end. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
months_between(end: Column, start: Column, roundOff: Boolean): ColumnReturns number of months between dates end and start. If roundOff is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, dayOfWeek: String): ColumnReturns the first date which is later than the value of the date column that is on the specified day of the week.
For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: String): ColumnReturns date truncated to the unit specified by the format.
For example, trunc("2018-11-19 12:01:19", "year") returns 2018-01-01
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month
date_trunc(format: String, timestamp: Column): ColumnReturns timestamp truncated to the unit specified by the format.
For example, date_trunc("year", "2018-11-19 12:01:19") returns 2018-01-01 00:00:00
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month,
‘day’, ‘dd’ to truncate by day,
Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’, ‘quarter’
year(e: Column): ColumnExtracts the year as an integer from a given date/timestamp/string
quarter(e: Column): ColumnExtracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): ColumnExtracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): ColumnExtracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
dayofmonth(e: Column): ColumnExtracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(e: Column): ColumnExtracts the day of the year as an integer from a given date/timestamp/string.
weekofyear(e: Column): ColumnExtracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): ColumnReturns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015.
from_unixtime(ut: Column): ColumnConverts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: String): ColumnConverts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp(): ColumnReturns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): ColumnConverts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: String): ColumnConverts time string with given pattern to Unix timestamp (in seconds).

Spark SQL Timestamp Functions

Below are some of the Spark SQL Timestamp functions, these functions operate on both date and timestamp values. Select each link for a description and example of each function.

The default format of the Spark Timestamp is yyyy-MM-dd HH:mm:ss.SSSS

Timestamp Function SignatureTimestamp Function Description
current_timestamp () : ColumnReturns the current timestamp as a timestamp column
current_timestamp () : ColumnReturns the current timestamp as a timestamp column
hour(e: Column): ColumnExtracts the hours as an integer from a given date/timestamp/string.
minute(e: Column): ColumnExtracts the minutes as an integer from a given date/timestamp/string.
second(e: Column): ColumnExtracts the seconds as an integer from a given date/timestamp/string.
to_timestamp(s: Column): ColumnConverts to a timestamp by casting rules to TimestampType.
to_timestamp(s: Column, fmt: String): ColumnConverts time string with the given pattern to timestamp.

Spark Date and Timestamp Window Functions

Below are Data and Timestamp window functions.

Date & Time Window Function SyntaxDate & Time Window Function Description
window(timeColumn: Column, windowDuration: String,
slideDuration: String, startTime: String): Column
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
window(timeColumn: Column, windowDuration: String, slideDuration: String): ColumnBucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC
window(timeColumn: Column, windowDuration: String): ColumnGenerates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC.

Spark Date Functions Examples

Below are most used examples of Date Functions.

current_date() and date_format()

We will see how to get the current date and convert date into a specific date format using date_format() with Scala example. Below example parses the date and converts from ‘yyyy-dd-mm’ to ‘MM-dd-yyyy’ format.


import org.apache.spark.sql.functions._
Seq(("2019-01-23"))
  .toDF("Input")
  .select( 
    current_date()as("current_date"), 
    col("Input"), 
    date_format(col("Input"), "MM-dd-yyyy").as("format") 
  ).show()

+------------+----------+-----------+ 
|current_date| Input    |format      | 
+------------+----------+-----------+ 
| 2019-07-23 |2019-01-23| 01-23-2019 |
 +------------+----------+-----------+

to_date()

Below example converts string in date format ‘MM/dd/yyyy’ to a DateType ‘yyyy-MM-dd’ using to_date() with Scala example.


import org.apache.spark.sql.functions._
Seq(("04/13/2019"))
   .toDF("Input")
  .select( col("Input"), 
           to_date(col("Input"), "MM/dd/yyyy").as("to_date") 
   ).show()

+----------+----------+ 
|Input     |to_date   | 
+----------+----------+ 
|04/13/2019|2019-04-13| 
+----------+----------+

datediff()

Below example returns the difference between two dates using datediff() with Scala example.


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
   .toDF("input")
   .select( col("input"), current_date(), 
       datediff(current_date(),col("input")).as("diff") 
    ).show()

+----------+--------------+--------+ 
| input    |current_date()| diff   | 
+----------+--------------+--------+ 
|2019-01-23| 2019-07-23   | 181    | 
|2019-06-24| 2019-07-23   | 29     | 
|2019-09-20| 2019-07-23   | -59    | 
+----------+--------------+--------+

months_between()

Below example returns the months between two dates using months_between() with Scala language.


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
   .toDF("date")
  .select( col("date"), current_date(), 
       datediff(current_date(),col("date")).as("datediff"), 
       months_between(current_date(),col("date")).as("months_between")
   ).show()

+----------+--------------+--------+--------------+ 
| date     |current_date()|datediff|months_between| 
+----------+--------------+--------+--------------+ 
|2019-01-23| 2019-07-23   |     181|           6.0| 
|2019-06-24| 2019-07-23   |      29|    0.96774194| 
|2019-09-20| 2019-07-23   |     -59|   -1.90322581| 
+----------+--------------+--------+--------------+

trunc()

Below example truncates date at a specified unit using trunc() with Scala language.


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
    .toDF("input")
    .select( col("input"), 
          trunc(col("input"),"Month").as("Month_Trunc"), 
          trunc(col("input"),"Year").as("Month_Year"), 
          trunc(col("input"),"Month").as("Month_Trunc") 
     ).show()

+----------+-----------+----------+-----------+ 
| input    |Month_Trunc|Month_Year|Month_Trunc| 
+----------+-----------+----------+-----------+ 
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01| 
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01| 
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|
+----------+-----------+----------+-----------+

add_months() , date_add(), date_sub()

Here we are adding and subtracting date and month from a given input.


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")).toDF("input")
  .select( col("input"), 
      add_months(col("input"),3).as("add_months"), 
      add_months(col("input"),-3).as("sub_months"), 
      date_add(col("input"),4).as("date_add"), 
      date_sub(col("input"),4).as("date_sub") 
   ).show()

+----------+----------+----------+----------+----------+ 
| input    |add_months|sub_months| date_add | date_sub | 
+----------+----------+----------+----------+----------+ 
|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19| 
|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20| 
|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16| 
+----------+----------+----------+----------+----------+

year(), month(), month()

dayofweek(), dayofmonth(), dayofyear()

next_day(), weekofyear()


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
  .toDF("input")
  .select( col("input"), year(col("input")).as("year"), 
       month(col("input")).as("month"), 
       dayofweek(col("input")).as("dayofweek"), 
       dayofmonth(col("input")).as("dayofmonth"), 
       dayofyear(col("input")).as("dayofyear"), 
       next_day(col("input"),"Sunday").as("next_day"), 
       weekofyear(col("input")).as("weekofyear") 
   ).show()

+----------+----+-----+---------+----------+---------+----------+----------+
| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear| 
+----------+----+-----+---------+----------+---------+----------+----------+ 
|2019-01-23|2019| 1| 4| 23| 23|2019-01-27| 4| 
|2019-06-24|2019| 6| 2| 24| 175|2019-06-30| 26| 
|2019-09-20|2019| 9| 6| 20| 263|2019-09-22| 38| 
+----------+----+-----+---------+----------+---------+----------+----------+

Spark Timestamp Functions Examples

Below are most used examples of Timestamp Functions.

current_timestamp()

Returns the current timestamp in spark default format yyyy-MM-dd HH:mm:ss


import org.apache.spark.sql.functions._
val df = Seq((1)).toDF("seq")
val curDate = df.withColumn("current_date",current_date().as("current_date"))
 .withColumn("current_timestamp",current_timestamp().as("current_timestamp"))
curDate.show(false)

Yields below output.


+---+------------+-----------------------+
|seq|current_date|current_timestamp      |
+---+------------+-----------------------+
|1  |2019-11-16  |2019-11-16 21:00:55.349|
+---+------------+-----------------------+

to_timestamp()

Converts string timestamp to Timestamp type format.


import org.apache.spark.sql.functions._
  val dfDate = Seq(("07-01-2019 12 01 19 406"),
    ("06-24-2019 12 01 19 406"),
    ("11-16-2019 16 44 55 406"),
    ("11-16-2019 16 50 59 406")).toDF("input_timestamp")

  dfDate.withColumn("datetype_timestamp",
          to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
    .show(false)

Yields below output


+-----------------------+-------------------+
|input_timestamp        |datetype_timestamp |
+-----------------------+-------------------+
|07-01-2019 12 01 19 406|2019-07-01 12:01:19|
|06-24-2019 12 01 19 406|2019-06-24 12:01:19|
|11-16-2019 16 44 55 406|2019-11-16 16:44:55|
|11-16-2019 16 50 59 406|2019-11-16 16:50:59|
+-----------------------+-------------------+

hour(), Minute() and second()


import org.apache.spark.sql.functions._
  val df = Seq(("2019-07-01 12:01:19.000"),
    ("2019-06-24 12:01:19.000"),
    ("2019-11-16 16:44:55.406"),
    ("2019-11-16 16:50:59.406")).toDF("input_timestamp")

  df.withColumn("hour", hour(col("input_timestamp")))
    .withColumn("minute", minute(col("input_timestamp")))
    .withColumn("second", second(col("input_timestamp")))
    .show(false)

Yields below output


+-----------------------+----+------+------+
|input_timestamp        |hour|minute|second|
+-----------------------+----+------+------+
|2019-07-01 12:01:19.000|12  |1     |19    |
|2019-06-24 12:01:19.000|12  |1     |19    |
|2019-11-16 16:44:55.406|16  |44    |55    |
|2019-11-16 16:50:59.406|16  |50    |59    |
+-----------------------+----+------+------+

Conclusion:

In this post, I’ve consolidated the complete list of Spark Date and Timestamp Functions with a description and example of some commonly used. You can find more information about these at the following blog

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply

This Post Has 5 Comments

  1. Sivakumar N.G.

    how to aggregate the milliseconds in pyspark my format is 2021-10-26 07:03:21.867

  2. surya goli

    Hi, Can you explain how to pass the start time?

  3. Dileep

    How to convert string in “yyyy-mm-dd hh:mm:ss” format to time with timezone type? What’s the type that can hold time with timezone information?

  4. Dhananjaya

    I didn’t convinced with to_date examples. You have mentioned one date format, but in output, the date format you got after to_date transformation is different. Please share more info on to_date

    1. NNK

      Hi Dhananjaya, May I know what is it you are not convinced about. please let me know and I am happy to help you and correct the post.