• Post author:
  • Post category:PySpark
  • Post last modified:March 27, 2024
  • Reading time:9 mins read
You are currently viewing PySpark SQL – Working with Unix Time | Timestamp

In PySpark SQL, unix_timestamp() is used to get the current time and to convert the time string in a format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds) and from_unixtime() is used to convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representation of the timestamp.

Both unix_timestamp() & from_unixtime() can be used on PySQL SQL & DataFrame and these use the default timezone and the default locale of the system.

Unix Time (Epoch Time)

What is Unix Time (Epoch Time)

Unix time is also known as Epoch time which specifies the moment in time since 1970-01-01 00:00:00 UTC. It is the number of seconds passed since Epoch time. Epoch time is widely used in Unix like operating systems.

Note that Unix Epoch time doesn’t support a fraction of the second which is represented with SSS.

Before we start the usage of Unix Epoch Time, let’s Create a PySpark DataFrame with different Dates and Times. Note that our examples also don’t have a fraction of the second (SSS).


inputData = [("2019-07-01 12:01:19",
            "07-01-2019 12:01:19", 
            "07-01-2019")]
columns=["timestamp_1","timestamp_2","timestamp_3"]
df=spark.createDataFrame(
        data = inputData,
        schema = columns)
df.printSchema()
df.show(truncate=False)

To make it simple for our example, I’ve named the columns as timestamp_1,timestamp_2,timestamp_3. From above, printSchema() and show() on DataFrame yields below output.


root
 |-- timestamp_1: string (nullable = true)
 |-- timestamp_2: string (nullable = true)
 |-- timestamp_3: string (nullable = true)

+-------------------+-------------------+-----------+
|timestamp_1        |timestamp_2        |timestamp_3|
+-------------------+-------------------+-----------+
|2019-07-01 12:01:19|07-01-2019 12:01:19|07-01-2019 |
+-------------------+-------------------+-----------+

unix_timestamp() – Converts Date and Timestamp Column to Unix Time

Use PySpark SQL function unix_timestamp() is used to get the current time and to convert the time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds) by using the current timezone of the system.

Syntax:


1) def unix_timestamp()
2) def unix_timestamp(s: Column)
3) def unix_timestamp(s: Column, p: String)
  1. Returns the current time in Unix timestamp seconds (epoch time).
  2. Returns the Unix seconds in Long from the date/time column.
  3. Returns the Unix seconds in Long from the date/time column, use the second argument to specify the date/time format of the first argument.

#Convert timestamp to unix timestamp
from pyspark.sql.functions import *
df2 = df.select( 
      unix_timestamp(col("timestamp_1")).alias("timestamp_1"), 
      unix_timestamp(col("timestamp_2"),"MM-dd-yyyy HH:mm:ss").alias("timestamp_2"), 
      unix_timestamp(col("timestamp_3"),"MM-dd-yyyy").alias("timestamp_3"), 
      unix_timestamp().alias("timestamp_4") 
   )
df2.printSchema()
df2.show(truncate=False)

Yields below output. As you notice, the above example uses all three unix_timestamp() syntaxes.


root
 |-- timestamp_1: long (nullable = true)
 |-- timestamp_2: long (nullable = true)
 |-- timestamp_3: long (nullable = true)
 |-- timestamp_4: long (nullable = true)

+-----------+-----------+-----------+-----------+
|timestamp_1|timestamp_2|timestamp_3|timestamp_4|
+-----------+-----------+-----------+-----------+
|1562007679 |1562007679 |1561964400 |1577166257 |
+-----------+-----------+-----------+-----------+

from_unixtime() – Converts Unix Time Seconds to Date and Timestamp

Spark SQL Function from_unixtime() is used to convert the Unix timestamp to a String representing Date and Timestamp, in other words, it converts the Epoch time in seconds to date and timestamp.

Syntax:


def from_unixtime(ut: Column): Column
def from_unixtime(ut: Column, f: String): Column

This function has 2 signatures, the first one returns the Timestamp in a default format yyyy-MM-dd HH:mm:ss and the second one returns in a format specified in an argument.


// Convert Unix timestamp to timestamp
df3=df2.select(
    from_unixtime(col("timestamp_1")).alias("timestamp_1"),
    from_unixtime(col("timestamp_2"),"MM-dd-yyyy HH:mm:ss").alias("timestamp_2"),
    from_unixtime(col("timestamp_3"),"MM-dd-yyyy").alias("timestamp_3"),
    from_unixtime(col("timestamp_4")).alias("timestamp_4")
  )
df3.printSchema()
df3.show(truncate=False)

Yields below output. This snippet also uses both syntaxes of from_unixtime() function.


+-------------------+-------------------+-----------+-------------------+
|timestamp_1        |timestamp_2        |timestamp_3|timestamp_4        |
+-------------------+-------------------+-----------+-------------------+
|2019-07-01 12:01:19|07-01-2019 12:01:19|07-01-2019 |2021-02-26 17:27:45|
+-------------------+-------------------+-----------+-------------------+

Source code for reference


from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
          .appName('SparkByExamples.com') \
          .getOrCreate()

inputData = [("2019-07-01 12:01:19",
            "07-01-2019 12:01:19", 
            "07-01-2019")]
columns=["timestamp_1","timestamp_2","timestamp_3"]
df=spark.createDataFrame(
        data = inputData,
        schema = columns)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import *
df2 = df.select( 
      unix_timestamp(col("timestamp_1")).alias("timestamp_1"), 
      unix_timestamp(col("timestamp_2"),"MM-dd-yyyy HH:mm:ss").alias("timestamp_2"), 
      unix_timestamp(col("timestamp_3"),"MM-dd-yyyy").alias("timestamp_3"), 
      unix_timestamp().alias("timestamp_4") 
   )
df2.printSchema()
df2.show(truncate=False)

df3=df2.select(
    from_unixtime(col("timestamp_1")).alias("timestamp_1"),
    from_unixtime(col("timestamp_2"),"MM-dd-yyyy HH:mm:ss").alias("timestamp_2"),
    from_unixtime(col("timestamp_3"),"MM-dd-yyyy").alias("timestamp_3"),
    from_unixtime(col("timestamp_4")).alias("timestamp_4")
  )
df3.printSchema()
df3.show(truncate=False)

Conclusion

In this article, you have learned about Unix Time, how to use the Spark SQL function unix_timestamp() to convert Date and Timestamp to Unix Time seconds, and from_unixtime() function to convert Unix Time seconds to Date and Timestamp back.

Related Articles:

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium