You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling their size to terabytes and petabytes of data. The data typically comes from multiple diverse sources and can include structured, semi-structured, or unstructured data. A data lake helps you store everything in its original, untransformed state. This method differs from a traditional [data warehouse](../relational-data/data-warehousing.yml), which transforms and processes data at the time of ingestion.
13
+
A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are designed to scale cost-effectively to terabytes and petabytes data, making them suitable for handling massive and diverse datasets. The data typically comes from multiple diverse sources and can include structured data (like relational tables), semi-structureddata (like JSON, XML, or logs), and unstructured data (like images, audio, or video).
14
14
15
-

15
+
A data lake helps you store everything in its original, untransformed state, deferring transformation until the data is needed. This is a concept known as schema-on-read. This contrasts with a [data warehouse](../relational-data/data-warehousing.yml), which enforces structure and applies transformations as data is ingested, known as schema-on-write.
16
16
17
-
Key data lake use cases include:
18
-
- Cloud and Internet of Things (IoT) data movement.
19
-
- Big data processing.
20
-
- Analytics.
21
-
- Reporting.
22
-
- On-premises data movement.
17
+

23
18
24
-
Consider the following advantages of a data lake:
19
+
Common data lake use cases include:
25
20
26
-
- A data lake never deletes data because it stores data in its raw format. This feature is especially useful in a big data environment because you might not know in advance what insights you can get from the data.
21
+
-**Data ingestion and movement**: Collect and consolidate data from cloud services, IoT devices, on-premises systems, and streaming sources into a single repository.
22
+
-**Big data processing**: Handle high-volume, high-velocity data at scale using distributed processing frameworks.
23
+
-**Analytics and machine learning**: Support exploratory analysis, advanced analytics, and AI model training and fine-tuning on large, diverse datasets.
24
+
-**Business intelligence and reporting**: Enable dashboards and reports by integrating curated subsets of lake data into warehouses or BI tools.
25
+
-**Data archiving and compliance**: Store historical or raw datasets for long-term retention, auditability, and regulatory needs.
27
26
28
-
- Users can explore the data and create their own queries.
29
-
- A data lake might be faster than traditional extract, transform, load (ETL) tools.
30
-
- A data lake is more flexible than a data warehouse because it can store unstructured and semi-structured data.
31
27
32
-
A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault tolerance, infinite scalability, and high-throughput ingestion of various shapes and sizes of data. Data lake processing involves one or more processing engines that can incorporate these goals and can operate on data that's stored in a data lake at scale.
28
+
## Advantages of a data lake
29
+
30
+
-**Retains raw data for future use**: A data lake is designed to retain data in its raw format, ensuring long-term availability for future use. This capability is particularly valuable in a big data environment, where the potential insights from the data may not be known in advance. Data can also be archived as needed without losing its raw state.
31
+
-**Self-service exploration**: Analysts and data scientists can query data directly, encouraging experimentation and discovery.
32
+
-**Flexible data support**: Unlike warehouses that require structured formats, lakes can natively handle structured, semi-structured, and unstructured data.
33
+
-**Scalable and performant**: In distributed architectures, data lakes enable parallel ingestion and distributed execution at scale, frequently outperforming traditional ETL pipelines in high-volume workloads. The performance benefits stem from:
34
+
- Parallelism: Distributed compute engines (e.g., Spark) partition data and execute transformations across multiple nodes concurrently, while traditional ETL frameworks often rely on sequential or limited multi-threaded execution.
35
+
- Scalability: Distributed systems scale horizontally by elastically adding compute and storage nodes, whereas traditional ETL pipelines typically depend on vertical scaling of a single host, which quickly hits resource limits.
36
+
-**Foundation for hybrid architectures**: Data lakes often coexist with warehouses in a lakehouse approach, combining raw storage with structured query performance.
37
+
38
+
A modern data lake solution comprises two core elements:
39
+
40
+
-**Storage**: Built for durability, fault tolerance, infinite scalability, and high-throughput ingestion of diverse data types.
41
+
-**Processing**: Powered by engines such as Apache Spark in Azure Databricks, Microsoft Fabric, enabling large-scale transformations, analytics, and machine learning.
42
+
43
+
Additionally, mature solutions incorporate metadata management, security, and governance to ensure data quality, discoverability, and compliance.
33
44
34
45
## When you should use a data lake
35
46
36
-
We recommend that you use a data lake for data exploration, data analytics, and machine learning.
47
+
We recommend using a data lake for exploratory analytics, advanced data science, and machine learning workloads. Because lakes retain data in its raw state and support schema-on-read, they allow teams to experiment with diverse data types and uncover insights that traditional warehouses may not capture.
48
+
49
+
### Data lake as a source for data warehouses
50
+
51
+
A data lake can act as the upstream source for a data warehouse, where raw data is ingested from source systems into the lake (Extract and Load), and modern warehouses like the Fabric Warehouse use built-in Massively Parallel Processing (MPP) SQL engines to handle transformations, converting the raw data into a structured format [extract, load, transform (ELT)](../relational-data/etl.yml#extract-load-and-transform-elt). This differs from traditional ETL pipelines, where data is both extracted and transformed within the ETL engine before being loaded into the warehouse. Both approaches provide flexibility depending on the use case, balancing factors such as data quality, performance, and resource utilization while ensuring the warehouse is optimized for analytics.
37
52
38
-
A data lake can act as the data source for a data warehouse. When you use this method, the data lake ingests raw data and then transforms it into a structured queryable format. Typically, this transformation uses an [extract, load, transform (ELT)](../relational-data/etl.yml#extract-load-and-transform-elt) pipeline in which the data is ingested and transformed in place. Relational source data might go directly into the data warehouse via an ETL process and skip the data lake.
39
53
40
-
You can use data lake stores in event streaming or IoT scenarios because data lakes can persist large amounts of relational and nonrelational data without transformation or schema definition. Data lakes can handle high volumes of small writes at low latency and are optimized for massive throughput.
54
+
### Event streaming and IoT scenarios
55
+
56
+
Data lakes are effective for event streaming and IoT use cases, where high-velocity data must be persisted at scale without upfront schema constraints. They can ingest and store both relational and non-relational event streams, handle high volumes of small writes with low latency, and support massive parallel throughput. This makes them well suited for applications such as real-time monitoring, predictive maintenance, and anomaly detection.
41
57
42
58
The following table compares data lakes and data warehouses.
43
59
@@ -49,35 +65,44 @@ The following table compares data lakes and data warehouses.
49
65
|**Data transformation stage**| Transformation happens at query time, impacting overall processing time | Transformation happens during the ETL or ELT process |
50
66
|**Scalability**| Highly scalable and cost-effective for large volumes of diverse data | Scalable but more expensive, especially at large scale |
51
67
|**Cost**| Lower storage costs; compute costs vary based on usage | Higher storage and compute costs due to performance optimizations |
52
-
|**Use case fit**| Best for big data, machine learning, and exploratory analytics | Ideal for business intelligence, reporting, and structured data analysis |
68
+
|**Use case fit**| Best for big data, machine learning, and exploratory analytics. In medallion architectures, the Gold layer is leveraged for reporting purposes| Ideal for business intelligence, reporting, and structured data analysis |
53
69
54
-
## Challenges
70
+
## Challenges of data lakes
55
71
56
-
-*Large volumes of data:* The management of vast amounts of raw and unstructured data can be complex and resource-intensive, so you need robust infrastructureand tools.
72
+
-**Scalability and complexity**: Managing petabytes of raw, unstructured, and semi-structured data requires robust infrastructure, distributed processing, and careful cost management.
57
73
58
-
-*Potential bottlenecks:* Data processing can introduce delays and inefficiencies, especially when you have high volumes of data and diverse data types.
59
-
-*Data corruption risks:* Improper data validation and monitoring introduces a risk of data corruption, which can compromise the integrity of the data lake.
60
-
-*Quality control problems:* Proper data quality is a challenge because of the variety of data sources and formats. You must implement stringent data governance practices.
61
-
-*Performance problems:* Query performance can degrade as the data lake grows, so you must optimize storage and processing strategies.
74
+
-**Processing bottlenecks**: As data volume and diversity increase, transformation and query workloads can introduce latency, requiring careful pipeline design and workload orchestration.
75
+
-**Data integrity risks**: Without strong validation and monitoring, errors or incomplete ingestions can compromise the reliability of the lake's contents.
76
+
-**Data quality and governance**: The variety of sources and formats makes it difficult to enforce consistent standards. Implementing metadata management, cataloging, and governance frameworks is critical.
77
+
-**Performance at scale**: Query performance and storage efficiency can degrade as the lake grows, requiring optimization strategies such as partitioning, indexing, and caching.
78
+
-**Security and access control**: Ensuring appropriate permissions and auditing across diverse datasets to prevent misuse of sensitive data requires planning.
79
+
-**Discoverability**: Without proper cataloging, lakes can devolve into "data swamps" where valuable information is present but inaccessible or misunderstood.
62
80
63
81
## Technology choices
64
82
65
83
When you build a comprehensive data lake solution on Azure, consider the following technologies:
66
84
67
-
-[Azure Data Lake Storage](/azure/storage/blobs/data-lake-storage-introduction) combines Azure Blob Storage with data lake capabilities, which provides Apache Hadoop-compatible access, hierarchical namespace capabilities, and enhanced security for efficient big data analytics.
85
+
-[Azure Data Lake Storage](/azure/storage/blobs/data-lake-storage-introduction) combines Azure Blob Storage with data lake capabilities, which provides Apache Hadoop-compatible access, hierarchical namespace capabilities, and enhanced security for efficient big data analytics. It's designed to handle massive amounts of structured, semi-structured, and unstructured data.
86
+
87
+
-[Azure Databricks](/azure/databricks/introduction/) is a cloud-based data analytics and machine learning platform that combines the best of Apache Spark with deep integration into the Microsoft Azure ecosystem. It provides a collaborative environment where data engineers, data scientists, and analysts can work together to ingest, process, analyze, and model large volumes of data.
88
+
89
+
-[Azure Data Factory](/azure/data-factory/introduction) is a Microsoft Azure's cloud-based data integration and ETL (Extract, Transform, Load) service. You use it to move, transform, and orchestrate data workflows across different sources, whether in the cloud or on-premises.
68
90
69
-
-[Azure Databricks](/azure/databricks/introduction/) is a unified platform that you can use to process, store, analyze, and monetize data. It supports ETL processes, dashboards, security, data exploration, machine learning, and generative AI.
70
-
-[Azure Synapse Analytics](/azure/synapse-analytics/overview-what-is) is a unified service that you can use to ingest, explore, prepare, manage, and serve data for immediate business intelligence and machine learning needs. It integrates deeply with Azure data lakes so that you can query and analyze large datasets efficiently.
71
-
-[Azure Data Factory](/azure/data-factory/introduction) is a cloud-based data integration service that you can use to create data-driven workflows to then orchestrate and automate data movement and transformation.
72
-
-[Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview) is a comprehensive data platform that unifies data engineering, data science, data warehousing, real-time analytics, and business intelligence into a single solution.
91
+
-[Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview) is Microsoft's end-to-end data analytics platform that unifies data movement, data science, real-time analytics, and business intelligence into a single software-as-a-service (SaaS) experience.
92
+
93
+
Each Microsoft Fabric tenant is automatically provisioned with a single logical data lake, known as OneLake. Built on Azure Data Lake Storage (ADLS) Gen2, OneLake provides a unified storage layer capable of handling both structured and unstructured data formats.
73
94
74
95
## Contributors
75
96
76
97
*This article is maintained by Microsoft. It was originally written by the following contributors.*
0 commit comments