Data Lake vs Data Warehouse

  • Post author:
  • Post category:Data Lake
  • Post last modified:September 2, 2020

In this Data Lake vs Data Warehouse article, I will explain what is Data Lake and it’s differences with Data warehouse.

A Data Lake is a centralized repository of structured, semi-structured, unstructured, and binary data that allows you to store a large amount of data as-is in its original raw format.

Once the data stored in a lake, it cannot or should not be changed hence it is an immutable collection of Data. First, it’s been introduced by James Dixon, who was a chief technology officer at Pentaho

Most of us think Data Lake is similar to Data Warehouse, but these two are completely different, with one similarity being both are used to store a large collection of data in a single repository location.

Now, let’s see what data warehouse offers, In simple words, data stored in data warehouses are transformed data meaning created after applying some transformations and filtering from the OLTP database using ETL tools and it is created for a specific purpose basically for reporting and analytics.

Note that Data warehouse is not replacing by Data Lake, Both are used in the industry and the purpose is different. Let’s see the differences between Data Lake vs Data Warehouse and when to use each.

Data Lake vs Data Warehouse

[table “44” not found /]

The above table gives you a brief idea of the difference between Data Lake vs Data Warehouse. Now, lets deep-dive into each difference.

Data Lake contains “Source of Truth” data

In a lake, data stored from various sources as-is in its original format, It is a single “Source of Truth” for data, whereas in a data warehouse that data loses its originality as it’s been transformed, aggregated, and filter using ETL tools. This is one of the major differences between Data Lake vs Data Warehouse.

Lake supports various “Types of Data”

Lake supports various types of non-curated Data.

  • Structured – Extracted data from RDBMS tables
  • Semi-structured – CSV, XML, JSON
  • Unstructured – Text, PDF, logs,
  • Binary – Image, Audio, Video files

Whereas the data warehouse contains structured data in rows and columns format and the data here is curated with high quality.

Data in Lake has not defined a “Purpose”

Data in the lakes have no purpose defined hence it can be used to derive a new purpose as data evolves and business wants a new product. When a business wants to derive a new purpose or product, the data from lake is transformed from various formats to structured, cleansed, grouped and finally load into a data warehouse which will be used by business analysts using analytical tools.

Data in Lake is available for all “Types of Users”

Data in lakes is available for data scientists, data engineers, business analysts users whereas data warehouse is used by only data analysts. If you notice data lake can also be used by data analyst but the data needs to be curated before use.

Data Lakes comes with “Huge Size & Low cost”

Data in the lake grows very fast as it stores data from various sources like IoT devices, web sites, mobile apps, social media, logs, and data providers. And the data warehouse is much smaller as it is filtered and aggregated data.

Data Lake stores “Real-time” data

Since data in lakes are in original raw format, as we receive the data, it can be stored in lakes immediately without delay whereas the data in data warehouse is delayed by few hrs or 1 days as it needs to be extracted, transformed and loaded to warehouse by nightly jobs.

Data Lake has a “Future Value”

When businesses would need to launch new products in the future, the data in the lake comes in handy as they can derive a new value out of it as per the future needs. Data warehouse can’t be reused for the future as it’s created for current market needs.

Data Lake has no “Schema”

As Data lakes supports various formats usually they do not have any specific schema, when a user wants a data for s specific purpose, he will define the schema at the time of read. Whereas data warehouse has a predefined structure and schema in a relational database table formats hence when preparing data user need to aware of this schema and data should be prepared to fit in this specific schema. Not following the schema results in data lose.

Design & Technology

Data warehouse uses ETL tools to extract, transform, and finally loads the data into high-cost relational databases whereas Data lake uses low-cost commodity hardware and stores the data in HDFS, AWS S3, and Azure blob storage, when data is needed for analytics it will be transformed and used.

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

This Post Has 4 Comments

  1. Arun Kumar

    In the table Data Lake vs Data Warehouse the last row for Design and Technology has incorrect values… Looks like the values got interchanged..

    1. NNK

      Thanks, Arun. you are right. I’ve changed it. Thanks for mentioning. Hope you like the differences 🙂

  2. Aditya Bairy

    Wonderfully explained, very helpful.

    1. NNK

      Thank you for reading Data lake vs Data warehouse differences.

You are currently viewing Data Lake vs Data Warehouse
Photo by Franki Chamaki on Unsplash