In this Data Lake vs Data Warehouse article, I will explain what is Data Lake and it’s differences with Data warehouse.
A Data Lake is a centralized repository of structured, semi-structured, unstructured, and binary data that allows you to store a large amount of data as-is in its original raw format.
Once the data stored in a lake, it cannot or should not be changed hence it is an immutable collection of Data. First, it’s been introduced by James Dixon, who was a chief technology officer at Pentaho
Most of us think Data Lake is similar to Data Warehouse, but these two are completely different, with one similarity being both are used to store a large collection of data in a single repository location.
Now, let’s see what data warehouse offers, In simple words, data stored in data warehouses are transformed data meaning created after applying some transformations and filtering from the OLTP database using ETL tools and it is created for a specific purpose basically for reporting and analytics.
Note that Data warehouse is not replacing by Data Lake, Both are used in the industry and the purpose is different. Let’s see the differences between Data Lake vs Data Warehouse and when to use each.
Data Lake vs Data Warehouse
|Characteristics||Data lake||Data warehouse|
|Data Source of Truth||Data is in Raw format||Data not in original format|
|Type of Data & Quality||Stores non curated structured, semi-structured, unstructured, and binary data.||Stores only curated high-quality Structured data|
|Purpose||The purpose is not defined hence you can derive a purpose.||Created for a specific purpose|
|Types of Users||Used by Data scientists, Data analysts and Data Engineer||Data analysts and business analysts|
|Data Use||Used for Predictive analytics, real-time analytics, data discovery||Used for Reporting & Analytics|
|Size & Cost||uses lower-cost storage and stores all data business receive/produce and the data can grow petabytes||Uses higher cost storage hence It stores filtered and grouped data|
|Real time||Data pushed to the lake immediately||Data is not up-to-date usually delays by 1 or a few days|
|Schema||No schema at writing and Schema creates at the time of read||schema-on-write (Schema creates before we write hence need to follow in order to store)|
|Value||Organizations can generate business value from their data||It will be used to serve only specific value that defined at the time of the creation|
|Design & Technology||HDFS, AWS S3, Azure blob||RDBMS database|
The above table gives you a brief idea of the difference between Data Lake vs Data Warehouse. Now, lets deep-dive into each difference.
Data Lake contains “Source of Truth” data
In a lake, data stored from various sources as-is in its original format, It is a single “Source of Truth” for data, whereas in a data warehouse that data loses its originality as it’s been transformed, aggregated, and filter using ETL tools. This is one of the major differences between Data Lake vs Data Warehouse.
Lake supports various “Types of Data”
Lake supports various types of non-curated Data.
- Structured – Extracted data from RDBMS tables
- Semi-structured – CSV, XML, JSON
- Unstructured – Text, PDF, logs,
- Binary – Image, Audio, Video files
Whereas the data warehouse contains structured data in rows and columns format and the data here is curated with high quality.
Data in Lake has not defined a “Purpose”
Data in the lakes have no purpose defined hence it can be used to derive a new purpose as data evolves and business wants a new product. When a business wants to derive a new purpose or product, the data from lake is transformed from various formats to structured, cleansed, grouped and finally load into a data warehouse which will be used by business analysts using analytical tools.
Data in Lake is available for all “Types of Users”
Data in lakes is available for data scientists, data engineers, business analysts users whereas data warehouse is used by only data analysts. If you notice data lake can also be used by data analyst but the data needs to be curated before use.
Data Lakes comes with “Huge Size & Low cost”
Data in the lake grows very fast as it stores data from various sources like IoT devices, web sites, mobile apps, social media, logs, and data providers. And the data warehouse is much smaller as it is filtered and aggregated data.
Data Lake stores “Real-time” data
Since data in lakes are in original raw format, as we receive the data, it can be stored in lakes immediately without delay whereas the data in data warehouse is delayed by few hrs or 1 days as it needs to be extracted, transformed and loaded to warehouse by nightly jobs.
Data Lake has a “Future Value”
When businesses would need to launch new products in the future, the data in the lake comes in handy as they can derive a new value out of it as per the future needs. Data warehouse can’t be reused for the future as it’s created for current market needs.
Data Lake has no “Schema”
As Data lakes supports various formats usually they do not have any specific schema, when a user wants a data for s specific purpose, he will define the schema at the time of read. Whereas data warehouse has a predefined structure and schema in a relational database table formats hence when preparing data user need to aware of this schema and data should be prepared to fit in this specific schema. Not following the schema results in data lose.
Design & Technology
Data warehouse uses ETL tools to extract, transform, and finally loads the data into high-cost relational databases whereas Data lake uses low-cost commodity hardware and stores the data in HDFS, AWS S3, and Azure blob storage, when data is needed for analytics it will be transformed and used.