Time Travel with Delta Tables in Databricks?

What is time travel in the delta table on Databricks? In modern-day to day ETL activities, we see a huge amount of data trafficking into the data lake. There are always a few rows inserted, updated, and deleted. Time travel is a key feature present in Delta Lake technology in Databricks.

Delta Lake uses transaction logging to store the history of changes on your data and with this feature, you can access the historical version of data that is changing over time and helps to go back in time travel on the delta table and see the previous snapshot of the data and also helps in auditing, logging, and data tracking.

1. Challenges in Data Transition

  • Auditing: Looking over the data changes is critical to keep data in compliance and for debugging any changes. Data Lake without a time travel feature is failed in such scenarios as we can’t roll back to the previous state once changes are done.
  • Reproduce experiments & reports: ML engineers try to create many models using some given set of data. When they try to reproduce the model after a period of time, typically the source data has been modified and they struggle to reproduce their experiments.
  • Rollbacks: In the case of Data transitions that do simple appends, rollbacks could be possible by date-based partitioning. But in the case of upserts or changes by mistake, it’s very complicated to roll back data to the previous state.

2. What is Delta Table in Databricks

Delta’s time travel capabilities in Azure Databricks simplify building data pipelines for the above challenges. As you write into a Delta table or directory, every operation is automatically versioned and stored in transactional logs. You can access the different versions of the data in two different ways:

Now, let us create a Delta table and perform some modifications on the same table and try to play with the Time Travel feature.

In Databricks the time travel with delta table is achieved by using the following.

  1. Using a timestamp
  2. Using a version number

Note: By default, all the tables that are created in Databricks are Delta tables.

3 Create a Delta Table in Databricks

Here, I am using the community Databricks version to achieve this (https://community.cloud.databricks.com/). Create a cluster and attach the cluster to your notebook.

Let’s create the table and insert a few rows to the table.


# create table
CREATE TABLE students
       (id INT, name STRING, value DOUBLE);

# insert rows to table
INSERT INTO students VALUES (1, "Ted", 4.7), 
                            (2, "Tiffany", 5.5),
                            (3, "Vini", 6.3);

# describe table
DESCRIBE EXTENDED students;

You will get the below output.

time travel databricks
create a delta table in databricks

Here we have created a student table with some records and as you can see it’s by default provided as delta.

4. Update Delta Records and Check History

Let’s update the student delta table with id’s 1,3 and delete records with id 2. Add another cell to simulate the update and delete the row in the table


# update rows
update students set value = value * 10
                    where id in (1,3);

# delete rows
delete from students where id = 2;

# describe table
describe history students;

You will see something below on databricks.

time travel databricks
Describe History on students

As you can see from the above screenshot, there are total 4 versions since the table is created, with the top record being the most recent change

  • version0: Created the student table at 2022-11-14T12:09:24.000+0000
  • version1: Inserted records into the table at 2022-11-14T12:09:29.000+0000
  • version2: Updated the values of id’s 1,3 at 2022-11-14T12:09:35.000+0000
  • version3: Deleted the record of id’s 2 at 2022-11-14T12:09:39.000+0000

Notice that the describe result shows the version, timestamp of the transaction that occurred, operation, parameters, and metrics. Metrics in the results show the number of rows and files changed.

5. Query Delta Table in Time Travel

As we already know Delta tables in Databricks have the time travel functionality which can be explored either using timestamp or by version number.

Note: Regardless of what approach you use, it’s just a simple SQL SELECT command with extending “as of

5.1. Using the timestamp

Syntax:


# using timestamp
SELECT * FROM tableName TIMESTAMP as of "operation timestamp from history"

First, let’s see our initial table, the table before I run the update and delete. To get this let’s use the timestamp column.

Time Travel using Operation Timestamp

Now let’s see how the data is updated and deleted after each statement.

Time Travel delta table to update and delete

From the above examples, I hope it is clear how to query the table back in time.

5.2. Using Version number

As every operation on the delta table is marked with a version number, and you can use the version to travel back in time as well.

Syntax:


# using version
SELECT * FROM tableName VERSION as of "VERSION NUMBER"

Below I have executed some queries to get the initial version of the table, after updating and deleting. The result would be the same as what we got with a timestamp.

Time Travel databricks

6. Conclusion

Time travel of Delta table in databricks improves developer productivity tremendously. It helps:

  • Data scientists and ML experts manage their experiments better by going back to the source of truth.
  • Data engineers simplify their pipelines and roll back bad writes.
  • Data analysts do easy reporting.
  • Rollback to the previous state.

Related Articles

References

Sriram

Data Engineer. I write about BigData Architecture, tools and techniques that are used to build Bigdata pipelines and other generic blogs.

Leave a Reply

You are currently viewing Time Travel with Delta Tables in Databricks?