Data Lake vs Data Warehouse

In this Data Lake vs Data Warehouse article, I will explain what is Data Lake and it’s differences with Data warehouse.

Advertisements

A Data Lake is a centralized repository of structured, semi-structured, unstructured, and binary data that allows you to store a large amount of data as-is in its original raw format.

Once the data stored in a lake, it cannot or should not be changed hence it is an immutable collection of Data. First, it’s been introduced by James Dixon, who was a chief technology officer at Pentaho

Most of us think Data Lake is similar to Data Warehouse, but these two are completely different, with one similarity being both are used to store a large collection of data in a single repository location.

Now, let’s see what data warehouse offers, In simple words, data stored in data warehouses are transformed data meaning created after applying some transformations and filtering from the OLTP database using ETL tools and it is created for a specific purpose basically for reporting and analytics.

Note that Data warehouse is not replacing by Data Lake, Both are used in the industry and the purpose is different. Let’s see the differences between Data Lake vs Data Warehouse and when to use each.

Data Lake vs Data Warehouse

[table id=37 /]

The above table gives you a brief idea of the difference between Data Lake vs Data Warehouse. Now, lets deep-dive into each difference.

Data Lake contains “Source of Truth” data

In a lake, data stored from various sources as-is in its original format, It is a single “Source of Truth” for data, whereas in a data warehouse that data loses its originality as it’s been transformed, aggregated, and filter using ETL tools. This is one of the major differences between Data Lake vs Data Warehouse.

Lake supports various “Types of Data”

Lake supports various types of non-curated Data.

Structured – Extracted data from RDBMS tables
Semi-structured – CSV, XML, JSON
Unstructured – Text, PDF, logs,
Binary – Image, Audio, Video files

Whereas the data warehouse contains structured data in rows and columns format and the data here is curated with high quality.

Data in Lake has not defined a “Purpose”

Data in the lakes have no purpose defined hence it can be used to derive a new purpose as data evolves and business wants a new product. When a business wants to derive a new purpose or product, the data from lake is transformed from various formats to structured, cleansed, grouped and finally load into a data warehouse which will be used by business analysts using analytical tools.

Data in Lake is available for all “Types of Users”

Data in lakes is available for data scientists, data engineers, business analysts users whereas data warehouse is used by only data analysts. If you notice data lake can also be used by data analyst but the data needs to be curated before use.

Data Lakes comes with “Huge Size & Low cost”

Data in the lake grows very fast as it stores data from various sources like IoT devices, web sites, mobile apps, social media, logs, and data providers. And the data warehouse is much smaller as it is filtered and aggregated data.

Data Lake stores “Real-time” data

Since data in lakes are in original raw format, as we receive the data, it can be stored in lakes immediately without delay whereas the data in data warehouse is delayed by few hrs or 1 days as it needs to be extracted, transformed and loaded to warehouse by nightly jobs.

Data Lake has a “Future Value”

When businesses would need to launch new products in the future, the data in the lake comes in handy as they can derive a new value out of it as per the future needs. Data warehouse can’t be reused for the future as it’s created for current market needs.

Data Lake has no “Schema”

As Data lakes supports various formats usually they do not have any specific schema, when a user wants a data for s specific purpose, he will define the schema at the time of read. Whereas data warehouse has a predefined structure and schema in a relational database table formats hence when preparing data user need to aware of this schema and data should be prepared to fit in this specific schema. Not following the schema results in data lose.

Design & Technology

Data warehouse uses ETL tools to extract, transform, and finally loads the data into high-cost relational databases whereas Data lake uses low-cost commodity hardware and stores the data in HDFS, AWS S3, and Azure blob storage, when data is needed for analytics it will be transformed and used.

References

https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

This Post Has 4 Comments

NNK September 4, 2020

Thank you for reading Data lake vs Data warehouse differences.
Aditya Bairy September 4, 2020

Wonderfully explained, very helpful.
NNK September 3, 2020

Thanks, Arun. you are right. I’ve changed it. Thanks for mentioning. Hope you like the differences :)
Arun Kumar September 3, 2020

In the table Data Lake vs Data Warehouse the last row for Design and Technology has incorrect values… Looks like the values got interchanged..

Data Lake vs Data Warehouse

Data Lake contains “Source of Truth” data

Lake supports various “Types of Data”

Data in Lake has not defined a “Purpose”

Data in Lake is available for all “Types of Users”

Data Lakes comes with “Huge Size & Low cost”

Data Lake stores “Real-time” data

Data Lake has a “Future Value”

Data Lake has no “Schema”

Design & Technology

Related Articles

References

Leave a Reply

This Post Has 4 Comments