PySpark to_timestamp() – Convert String to Timestamp type

Use to_timestamp() function to convert String to Timestamp (TimestampType) in PySpark. The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.SSS. It is commonly used when you have a DataFrame with a column that contains date or timestamp values in string format and you want to convert them into a proper timestamp or date data type for further analysis or operations. Make sure to adjust the timestamp_format to match the format of the timestamp strings in your data. The format should follow the SimpleDateFormat pattern for date and time formats. I will explain how to use this function with a few examples.

Syntax – to_timestamp()


Syntax: to_timestamp(timestampString:Column) 
Syntax: to_timestamp(timestampString:Column,format:String) 
 

This function has the above two signatures that are defined in PySpark SQL Date & Timestamp Functions, the first syntax takes just one argument and the argument should be in Timestamp format ‘MM-dd-yyyy HH:mm:ss.SSS‘, when the format is not in this format, it returns null.

The second signature takes an additional String argument to specify the format of the input Timestamp; this supports formats specified in SimeDateFormat. Using this additional argument, you can cast String from any format to Timestamp type in PySpark.

Convert String to PySpark Timestamp type

In the below example, we convert the string pattern which is in PySpark default format to Timestamp type, since the input DataFrame column is in default Timestamp format, we use the first signature for conversion. And the second example uses the cast function to do the same.


from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2019-06-24 12:01:19.000")],
        schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
df.withColumn("timestamp",to_timestamp("input_timestamp")) \
  .show(truncate=False)

# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
         to_timestamp('timestamp').cast('string')) \
  .show(truncate=False)

In this snippet, we just add a new column timestamp by converting the input column from string to Timestamp type.


# Output
root
 |-- id: string (nullable = true)
 |-- timestamp: string (nullable = true)

+---+-----------------------+-------------------+
|id |input_timestamp        |timestamp          |
+---+-----------------------+-------------------+
|1  |2019-06-24 12:01:19.000|2019-06-24 12:01:19|
+---+-----------------------+-------------------+

Custom string format to Timestamp type

This example converts the input timestamp string from custom format to PySpark Timestamp type, to do this, we use the second syntax where it takes an additional argument to specify user-defined patterns for date-time formatting,


#when dates are not in Spark TimestampType format 'yyyy-MM-dd  HH:mm:ss.SSS'.
#Note that when dates are not in Spark Tiemstamp format, all Spark functions returns null
#Hence, first convert the input dates to Spark DateType using to_timestamp function
df.select(to_timestamp(lit('06-24-2019 12:01:19.000'),'MM-dd-yyyy HH:mm:ss.SSSS')) \
  .show()

#Displays
+-------------------------------------------------------------------+
|to_timestamp('06-24-2019 12:01:19.000', 'MM-dd-yyyy HH:mm:ss.SSSS')|
+-------------------------------------------------------------------+
|                                                2019-06-24 12:01:19|
+-------------------------------------------------------------------+

In case you want to convert string to date format use to_date() function. And here is another example to convert Timestamp to custom string pattern format.

SQL Example


#SQL string to TimestampType
spark.sql("select to_timestamp('2019-06-24 12:01:19.000') as timestamp")
#SQL CAST timestamp string to TimestampType
spark.sql("select timestamp('2019-06-24 12:01:19.000') as timestamp")
#SQL Custom string to TimestampType
spark.sql("select to_timestamp('06-24-2019 12:01:19.000','MM-dd-yyyy HH:mm:ss.SSSS') as timestamp")

Complete Example for quick reference


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()

from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2019-06-24 12:01:19.000")],
        schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
df.withColumn("timestamp",to_timestamp("input_timestamp")) \
  .show(truncate=False)
  
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp', \
         to_timestamp('input_timestamp').cast('string')) \
  .show(truncate=False)
  

df.select(to_timestamp(lit('06-24-2019 12:01:19.000'),'MM-dd-yyyy HH:mm:ss.SSSS')) \
  .show(truncate=False)
  
#SQL string to TimestampType
spark.sql("select to_timestamp('2019-06-24 12:01:19.000') as timestamp")
#SQL CAST timestamp string to TimestampType
spark.sql("select timestamp('2019-06-24 12:01:19.000') as timestamp")
#SQL Custom string to TimestampType
spark.sql("select to_timestamp('06-24-2019 12:01:19.000','MM-dd-yyyy HH:mm:ss.SSSS') as timestamp")

Frequently Asked Questions on to_timestamp()

How to handle timezone information with to_timestamp() in PySpark?

We can specify a timezone when converting timestamps. For example, you can use to_timestamp with the timestamp_format and timezone functions to handle time zones.

What happens if the format specified in to_timestamp() doesn’t match the format of the input string?

If the format doesn’t match, the function may raise an error or produce incorrect results. It’s essential to provide the correct format for accurate conversion.

Is to_timestamp() case-sensitive when matching format elements?

Yes, to_timestamp() is case-sensitive for format elements. Format elements like “yyyy,” “MM,” and “dd” must match the case in the input string.

What is the result data type when using to_timestamp() in PySpark?

The result of using to_timestamp() is a column containing timestamps or dates, depending on the format specified, with the TimestampType or DateType as the data type, respectively.

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply