You are currently viewing Spark SQL – Truncate Date Time by unit specified

In this tutorial, we will show you a Spark SQL DataFrame example of how to truncate Date and Time of a column using trunc() and date_trunc() functions and Scala language. We will also see how to use truncate when the DataFrame column has different date formats.

In this tutorial, we will show you a Spark SQL DataFrame example of how to truncate Date and Time of a DataFrame column using trunc() and date_trunc() functions with Scala language.

Truncating Date using trunc() Spark SQL function

Spark SQL DateFrame functions provide trunc() function to truncate Date at Year and Month units and returns Date in Spark DateType format “yyyy-MM-dd”. Note that Day doesn’t support by trunc() function and it returns null when used.


Seq("1").toDF("date").select(
    current_timestamp(),
    trunc(current_timestamp(),"Year").as("Year"),
    trunc(current_timestamp(),"Month").as("Month"),
    trunc(current_timestamp(),"Day").as("Day") // Not available so returns null
  ).show(false)

Output:


+-----------------------+----------+----------+----+
|current_timestamp()    |Year      |Month     |Day |
+-----------------------+----------+----------+----+
|2019-07-27 16:15:09.435|2019-01-01|2019-07-01|null|
+-----------------------+----------+----------+----+

Truncating Date and Time using date_trunc() Spark SQL function

Spark Dateframe SQL functions provide another truncate function date_trunc() to truncate at Year, Month, Day, Hour, Minute and Seconds units and returns Date in Spark DateType format “yyyy-MM-dd HH:mm:ss.SSSS”.


Seq("1").toDF("date").select(
    current_timestamp(),
    date_trunc("Year",current_timestamp()).as("Year"),
    date_trunc("Month",current_timestamp()).as("Month"),
    date_trunc("Day",current_timestamp()).as("Day"),
    date_trunc("Hour",current_timestamp()).as("Hour"),
    date_trunc("Minute",current_timestamp()).as("Minute")
  ).show(false)

Output:


+-----------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|current_timestamp()    |Year               |Month              |Day                |Hour               |Minute             |
+-----------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|2019-07-27 16:15:09.759|2019-01-01 00:00:00|2019-07-01 00:00:00|2019-07-27 00:00:00|2019-07-27 16:00:00|2019-07-27 16:15:00|
+-----------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

Truncate when DataFrame column is in DateType format

Below is another example of trunc() function when Spark Date column has just Date without Time.


 Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")).toDF("date").select(
    col("date"),
    trunc(col("date"),"Month").as("Month"),
    trunc(col("date"),"Year").as("Year")
  ).show()

Output:


+----------+----------+----------+
|      date|     Month|      Year|
+----------+----------+----------+
|2019-01-23|2019-01-01|2019-01-01|
|2019-06-24|2019-06-01|2019-01-01|
|2019-09-20|2019-09-01|2019-01-01|
+----------+----------+----------+

Truncate when Dataframe column, not in DateType format

In case, if your input Date is not in Spark DateType “yyyy-MM-dd” format, then first we should convert the Date String to DateType before we apply trunc() and date_trunc() function.


Seq(("01-23-2019"),("06-24-2019"),("09-20-2019")).toDF("date").select(
    col("date"),
    trunc(to_date(col("date"),"MM-dd-yyyy"),"Month").as("Month"),
    trunc(to_date(col("date"),"MM-dd-yyyy"),"Year").as("Year")
  ).show()

Output:


+----------+----------+----------+
|      date|     Month|      Year|
+----------+----------+----------+
|01-23-2019|2019-01-01|2019-01-01|
|06-24-2019|2019-06-01|2019-01-01|
|09-20-2019|2019-09-01|2019-01-01|
+----------+----------+----------+

Conclusion:

In this article, you have learned Spark SQL DataFrame example of how to truncate Date and Time of a column using Scala language and Date and Time functions. Also, you have learned the difference between trunc() and date_trunc() functions. You can find more information about these functions at the following blog

Happy Learning !!

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium