Skip to content

Commit 1c13412

Browse files
Merge pull request #14616 from callard1/BigDataArchitecturesNew
[Canopy] New Big data architectures
2 parents 44c8ebc + 20b3dcc commit 1c13412

File tree

1 file changed

+23
-22
lines changed

1 file changed

+23
-22
lines changed

docs/databases/guide/big-data-architectures.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Big Data Architectures
33
description: Learn how big data architectures manage the ingestion, processing, and analysis of data that's too large or complex for traditional database systems.
44
author: vibhareddyv
55
ms.author: vibhav
6-
ms.date: 03/07/2025
6+
ms.date: 09/12/2025
77
ms.topic: conceptual
88
ms.subservice: architecture-guide
99
ms.custom:
@@ -50,32 +50,19 @@ Most big data architectures include some or all of the following components:
5050

5151
- **Batch processing:** The datasets are large, so a big data solution often processes data files by using long-running batch jobs to filter, aggregate, and otherwise prepare data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files. You can use the following options:
5252

53-
- Run U-SQL jobs in Azure Data Lake Analytics.
54-
55-
- Use Hive, Pig, or custom MapReduce jobs in an Azure HDInsight Hadoop cluster.
56-
- Use Java, Scala, or Python programs in an HDInsight Spark cluster.
5753
- Use Python, Scala, or SQL language in Azure Databricks notebooks.
5854
- Use Python, Scala, or SQL language in Fabric notebooks.
5955

6056
- **Real-time message ingestion:** If the solution includes real-time sources, the architecture must capture and store real-time messages for stream processing. For example, you can have a simple data store that collects incoming messages for processing. However, many solutions need a message ingestion store to serve as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics. This part of a streaming architecture is often referred to as *stream buffering*. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
6157

6258
- **Stream processing:** After the solution captures real-time messages, it must process them by filtering, aggregating, and preparing the data for analysis. The processed stream data is then written to an output sink.
6359

64-
- Azure Stream Analytics is a managed stream processing service that uses continuously running SQL queries that operate on unbounded streams.
65-
66-
- You can use open-source Apache streaming technologies, like Spark Streaming, in an HDInsight cluster or Azure Databricks.
60+
- You can use open-source Apache streaming technologies, like Spark Streaming, streaming technologies in Azure Databricks.
6761
- Azure Functions is a serverless compute service that can run event-driven code, which is ideal for lightweight stream processing tasks.
6862
- Fabric supports real-time data processing by using event streams and Spark processing.
6963

70-
- **Machine learning:** To analyze prepared data from batch or stream processing, you can use machine learning algorithms to build models that predict outcomes or classify data. These models can be trained on large datasets. You can use the resulting models to analyze new data and make predictions.
71-
72-
Use [Azure Machine Learning](/azure/machine-learning/overview-what-is-azure-machine-learning) to do these tasks. Machine Learning provides tools to build, train, and deploy models. Alternatively, you can use pre-built APIs from Azure AI services for common machine learning tasks, such as vision, speech, language, and decision-making tasks.
73-
7464
- **Analytical data store:** Many big data solutions prepare data for analysis and then serve the processed data in a structured format that analytical tools can query. The analytical data store that serves these queries can be a Kimball-style relational data warehouse. Most traditional business intelligence (BI) solutions use this type of data warehouse. Alternatively, you can present the data through a low-latency NoSQL technology, such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store.
7565

76-
- Azure Synapse Analytics is a managed service for large-scale, cloud-based data warehousing.
77-
78-
- HDInsight supports Interactive Hive, HBase, and Spark SQL. These tools can serve data for analysis.
7966
- Fabric provides various data stores, including SQL databases, data warehouses, lakehouses, and eventhouses. These tools can serve data for analysis.
8067
- Azure provides other analytical data stores, such as Azure Databricks, Azure Data Explorer, Azure SQL Database, and Azure Cosmos DB.
8168

@@ -93,7 +80,7 @@ Most big data architectures include some or all of the following components:
9380

9481
## Lambda architecture
9582

96-
When you work with large datasets, it can take a long time to run the type of queries that clients need. These queries can't be performed in real time. And they often require algorithms such as [MapReduce](https://en.wikipedia.org/wiki/MapReduce) that operate in parallel across the entire dataset. The query results are stored separately from the raw data and used for further querying.
83+
When you work with large datasets, it can take a long time to run the type of queries that clients need. These queries can't be performed in real time, and they often require distributed processing algorithms such as [MapReduce](https://en.wikipedia.org/wiki/MapReduce) that operate in parallel across the entire dataset. The query results are stored separately from the raw data and used for further querying.
9784

9885
One drawback to this approach is that it introduces latency. If processing takes a few hours, a query might return results that are several hours old. Ideally, you should get some results in real time, potentially with a loss of accuracy, and combine these results with the results from batch analytics.
9986

@@ -117,6 +104,10 @@ Eventually, the hot and cold paths converge at the analytics client application.
117104

118105
The raw data that's stored at the batch layer is immutable. Incoming data is appended to the existing data, and the previous data isn't overwritten. Changes to the value of a particular datum are stored as a new time-stamped event record. Time-stamped event records allow for recomputation at any point in time across the history of the data collected. The ability to recompute the batch view from the original raw data is important because it enables the creation of new views as the system evolves.
119106

107+
### Machine learning in Lambda architecture
108+
109+
Lambda architectures support machine learning workloads by providing both historical data for model training and real-time data for inference. The batch layer enables training on comprehensive historical datasets using [Azure Machine Learning](/azure/machine-learning/overview-what-is-azure-machine-learning) or Fabric Data Science workloads. The speed layer facilitates real-time model inference and scoring. This dual approach allows for models trained on complete historical data while providing immediate predictions on incoming data streams.
110+
120111
## Kappa architecture
121112

122113
A drawback to the Lambda architecture is its complexity. Processing logic appears in two different places, the cold and hot paths, via different frameworks. This process leads to duplicate computation logic and complex management of the architecture for both paths.
@@ -131,6 +122,10 @@ Similar to the Lambda architecture's batch layer, the event data is immutable an
131122

132123
If you need to recompute the entire dataset (equivalent to what the batch layer does in the Lambda architecture), you can replay the stream. This process typically uses parallelism to complete the computation in a timely fashion.
133124

125+
### Machine learning in Kappa architecture
126+
127+
Kappa architectures enable unified machine learning workflows by processing all data through a single streaming pipeline. This approach simplifies model deployment and maintenance since the same processing logic applies to both historical and real-time data. You can use Azure Machine Learning or Fabric Data Science workloads to build models that process streaming data, enabling continuous learning and real-time adaptation. The architecture supports online learning algorithms that update models incrementally as new data arrives.
128+
134129
## Lakehouse architecture
135130

136131
A data lake is a centralized data repository that stores structured data (database tables), semi-structured data (XML files), and unstructured data (images and audio files). This data is in its raw, original format and doesn't require predefined schema. A data lake can handle large volumes of data, so it's suitable for big data processing and analytics. Data lakes use low-cost storage solutions, which provide a cost-effective way to store large amounts of data.
@@ -144,10 +139,12 @@ The **Lakehouse architecture** combines the best elements of data lakes and data
144139
Common use cases for a lakehouse architecture include:
145140

146141
- **Unified analytics:** Ideal for organizations that need a single platform for both historical and real-time data analysis
147-
148-
- **Machine learning:** Supports advanced analytics and machine learning workloads by integrating data management capabilities
149142
- **Data governance:** Ensures compliance and data quality across large datasets
150143

144+
### Machine learning in Lakehouse architecture
145+
146+
Lakehouse architectures excel at supporting end-to-end machine learning workflows by providing unified access to both structured and unstructured data. Data scientists can use Fabric Data Science workloads to access raw data for exploratory analysis, feature engineering, and model training without complex data movement. The architecture supports the complete machine learning lifecycle, from data preparation and model development using Azure Machine Learning or Fabric notebooks, to model deployment and monitoring. The unified storage layer enables efficient collaboration between data engineers and data scientists while maintaining data lineage and governance.
147+
151148
## IoT
152149

153150
The IoT represents any device that connects to the internet and sends or receives data. IoT devices include PCs, mobile phones, smart watches, smart thermostats, smart refrigerators, connected automobiles, and heart monitoring implants.
@@ -174,7 +171,7 @@ Common types of processing include:
174171

175172
- Handling special types of nontelemetry messages from devices, such as notifications and alarms.
176173

177-
- Machine learning.
174+
- Machine learning for predictive maintenance, anomaly detection, and intelligent decision-making.
178175

179176
In the previous diagram, the gray boxes are components of an IoT system that aren't directly related to event streaming. They're included in the diagram for completeness.
180177

@@ -184,14 +181,18 @@ In the previous diagram, the gray boxes are components of an IoT system that are
184181

185182
- Some IoT solutions allow **command and control messages** to be sent to devices.
186183

184+
### Machine learning in IoT architecture
185+
186+
IoT architectures use machine learning for intelligent edge computing and cloud-based analytics. Edge devices can run lightweight models for real-time decision-making, while comprehensive models process aggregated data in the cloud using Azure Machine Learning or Fabric Data Science workloads. Common applications include predictive maintenance, anomaly detection, and automated response systems. The architecture supports both streaming analytics for immediate insights and batch processing for model training and refinement using historical IoT data.
187+
187188
## Next steps
188189

189190
- [IoT Hub](/azure/iot-hub/)
190-
- [Event Hubs](/azure/event-hubs/)
191-
- [Stream Analytics](/azure/stream-analytics/stream-analytics-introduction)
192191
- [Azure Data Explorer](/azure/data-explorer/)
193-
- [Fabric](/fabric/)
192+
- [Microsoft Fabric decision guide: Choos a Data store](/fabric/fundamentals/decision-guide-data-store)
194193
- [Azure Databricks](/azure/databricks/)
194+
- [Azure Machine Learning](/azure/machine-learning/)
195+
- [Fabric Data Science](/fabric/data-science/)
195196

196197
## Related resources
197198

0 commit comments

Comments
 (0)