What is time travel in the delta table on Databricks? In modern-day to day ETL activities, we see a huge amount of data trafficking into the data lake. There are always a few rows inserted, updated, and deleted. Time travel is a key feature present in Delta Lake technology in Databricks.
Delta Lake uses transaction logging to store the history of changes on your data and with this feature, you can access the historical version of data that is changing over time and helps to go back in time travel on the delta table and see the previous snapshot of the data and also helps in auditing, logging, and data tracking.
Table of contents
1. Challenges in Data Transition
- Auditing: Looking over the data changes is critical to keep data in compliance and for debugging any changes. Data Lake without a time travel feature is failed in such scenarios as we can’t roll back to the previous state once changes are done.
- Reproduce experiments & reports: ML engineers try to create many models using some given set of data. When they try to reproduce the model after a period of time, typically the source data has been modified and they struggle to reproduce their experiments.
- Rollbacks: In the case of Data transitions that do simple appends, rollbacks could be possible by date-based partitioning. But in the case of upserts or changes by mistake, it’s very complicated to roll back data to the previous state.
2. What is Delta Table in Databricks
Delta’s time travel capabilities in Azure Databricks simplify building data pipelines for the above challenges. As you write into a Delta table or directory, every operation is automatically versioned and stored in transactional logs. You can access the different versions of the data in two different ways:
Now, let us create a Delta table and perform some modifications on the same table and try to play with the Time Travel feature.
In Databricks the time travel with delta table is achieved by using the following.
- Using a timestamp
- Using a version number
Note: By default, all the tables that are created in Databricks are Delta tables.
3 Create a Delta Table in Databricks
Here, I am using the community Databricks version to achieve this (https://community.cloud.databricks.com/). Create a cluster and attach the cluster to your notebook.
Let’s create the table and insert a few rows to the table.
# create table CREATE TABLE students (id INT, name STRING, value DOUBLE); # insert rows to table INSERT INTO students VALUES (1, "Ted", 4.7), (2, "Tiffany", 5.5), (3, "Vini", 6.3); # describe table DESCRIBE EXTENDED students;
You will get the below output.
Here we have created a student table with some records and as you can see it’s by default provided as delta.
4. Update Delta Records and Check History
Let’s update the student delta table with id’s 1,3 and delete records with id 2. Add another cell to simulate the update and delete the row in the table
# update rows update students set value = value * 10 where id in (1,3); # delete rows delete from students where id = 2; # describe table describe history students;
You will see something below on databricks.
As you can see from the above screenshot, there are total 4 versions since the table is created, with the top record being the most recent change
- version0: Created the student table at 2022-11-14T12:09:24.000+0000
- version1: Inserted records into the table at 2022-11-14T12:09:29.000+0000
- version2: Updated the values of id’s 1,3 at 2022-11-14T12:09:35.000+0000
- version3: Deleted the record of id’s 2 at 2022-11-14T12:09:39.000+0000
Notice that the
describe result shows the
timestamp of the transaction that occurred,
metrics. Metrics in the results show the number of rows and files changed.
5. Query Delta Table in Time Travel
As we already know Delta tables in Databricks have the time travel functionality which can be explored either using timestamp or by version number.
Note: Regardless of what approach you use, it’s just a simple
SQL SELECT command with extending “
5.1. Using the timestamp
# using timestamp SELECT * FROM tableName TIMESTAMP as of "operation timestamp from history"
First, let’s see our initial table, the table before I run the update and delete. To get this let’s use the
Now let’s see how the data is updated and deleted after each statement.
From the above examples, I hope it is clear how to query the table back in time.
5.2. Using Version number
As every operation on the delta table is marked with a version number, and you can use the
version to travel back in time as well.
# using version SELECT * FROM tableName VERSION as of "VERSION NUMBER"
Below I have executed some queries to get the initial version of the table, after updating and deleting. The result would be the same as what we got with a
Time travel of Delta table in databricks improves developer productivity tremendously. It helps:
- Data scientists and ML experts manage their experiments better by going back to the source of truth.
- Data engineers simplify their pipelines and roll back bad writes.
- Data analysts do easy reporting.
- Rollback to the previous state.
- Spark Timestamp – Extract hour, minute and second
- Spark Timestamp Difference in seconds, minutes and hours
- Spark – What is SparkSession Explained
- Spark Read XML file using Databricks API
- Spark Performance Tuning & Best Practices
- Spark Shell Command Usage with Examples