Data Lake vs Data Warehouse

In this Data Lake vs Data Warehouse article, I will explain what is Data Lake and it’s differences with Data warehouse.

A Data Lake is a centralized repository of structured, semi-structured, unstructured, and binary data that allows you to store a large amount of data as-is in its original raw format.

Once the data stored in a lake, it cannot or should not be changed hence it is an immutable collection of Data. First, it’s been introduced by James Dixon, who was a chief technology officer at Pentaho

Most of us think Data Lake is similar to Data Warehouse, but these two are completely different, with one similarity being both are used to store a large collection of data in a single repository location.

Now, let’s see what data warehouse offers, In simple words, data stored in data warehouses are transformed data meaning created after applying some transformations and filtering from the OLTP database using ETL tools and it is created for a specific purpose basically for reporting and analytics.

Note that Data warehouse is not replacing by Data Lake, Both are used in the industry and the purpose is different. Let’s see the differences between Data Lake vs Data Warehouse and when to use each.

Data Lake vs Data Warehouse

CharacteristicsData lakeData warehouse
Data Source of TruthData is in Raw formatData not in original format
Type of Data & QualityStores non curated structured, semi-structured, unstructured, and binary data.Stores only curated high-quality Structured data
PurposeThe purpose is not defined hence you can derive a purpose.Created for a specific purpose
Types of UsersUsed by Data scientists, Data analysts and Data EngineerData analysts and business analysts
Data UseUsed for Predictive analytics, real-time analytics, data discovery Used for Reporting & Analytics
Size & Costuses lower-cost storage and stores all data business receive/produce and the data can grow petabytesUses higher cost storage hence It stores filtered and grouped data
Real timeData pushed to the lake immediatelyData is not up-to-date usually delays by 1 or a few days
SchemaNo schema at writing and Schema creates at the time of readschema-on-write (Schema creates before we write hence need to follow in order to store)
ValueOrganizations can generate business value from their dataIt will be used to serve only specific value that defined at the time of the creation
Design & TechnologyHDFS, AWS S3, Azure blobRDBMS database

The above table gives you a brief idea of the difference between Data Lake vs Data Warehouse. Now, lets deep-dive into each difference.

Data Lake contains “Source of Truth” data

In a lake, data stored from various sources as-is in its original format, It is a single “Source of Truth” for data, whereas in a data warehouse that data loses its originality as it’s been transformed, aggregated, and filter using ETL tools. This is one of the major differences between Data Lake vs Data Warehouse.

Lake supports various “Types of Data”

Lake supports various types of non-curated Data.

  • Structured – Extracted data from RDBMS tables
  • Semi-structured – CSV, XML, JSON
  • Unstructured – Text, PDF, logs,
  • Binary – Image, Audio, Video files

Whereas the data warehouse contains structured data in rows and columns format and the data here is curated with high quality.

Data in Lake has not defined a “Purpose”

Data in the lakes have no purpose defined hence it can be used to derive a new purpose as data evolves and business wants a new product. When a business wants to derive a new purpose or product, the data from lake is transformed from various formats to structured, cleansed, grouped and finally load into a data warehouse which will be used by business analysts using analytical tools.

Data in Lake is available for all “Types of Users”

Data in lakes is available for data scientists, data engineers, business analysts users whereas data warehouse is used by only data analysts. If you notice data lake can also be used by data analyst but the data needs to be curated before use.

Data Lakes comes with “Huge Size & Low cost”

Data in the lake grows very fast as it stores data from various sources like IoT devices, web sites, mobile apps, social media, logs, and data providers. And the data warehouse is much smaller as it is filtered and aggregated data.

Data Lake stores “Real-time” data

Since data in lakes are in original raw format, as we receive the data, it can be stored in lakes immediately without delay whereas the data in data warehouse is delayed by few hrs or 1 days as it needs to be extracted, transformed and loaded to warehouse by nightly jobs.

Data Lake has a “Future Value”

When businesses would need to launch new products in the future, the data in the lake comes in handy as they can derive a new value out of it as per the future needs. Data warehouse can’t be reused for the future as it’s created for current market needs.

Data Lake has no “Schema”

As Data lakes supports various formats usually they do not have any specific schema, when a user wants a data for s specific purpose, he will define the schema at the time of read. Whereas data warehouse has a predefined structure and schema in a relational database table formats hence when preparing data user need to aware of this schema and data should be prepared to fit in this specific schema. Not following the schema results in data lose.

Design & Technology

Data warehouse uses ETL tools to extract, transform, and finally loads the data into high-cost relational databases whereas Data lake uses low-cost commodity hardware and stores the data in HDFS, AWS S3, and Azure blob storage, when data is needed for analytics it will be transformed and used.

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

This Post Has 4 Comments

  1. Arun Kumar

    In the table Data Lake vs Data Warehouse the last row for Design and Technology has incorrect values… Looks like the values got interchanged..

    1. NNK

      Thanks, Arun. you are right. I’ve changed it. Thanks for mentioning. Hope you like the differences 🙂

  2. Aditya Bairy

    Wonderfully explained, very helpful.

    1. NNK

      Thank you for reading Data lake vs Data warehouse differences.

You are currently viewing Data Lake vs Data Warehouse
Photo by Franki Chamaki on Unsplash