Click to edit Master title style
Big Data Processing
Platforms
Chapter 6
1
Contents
Click to edit Master title style
• Parallel Data Processing
• Distributed Data Processing
• Speed Consistency Volume (SCV)
2 2
Parallel data processing
Click to edit Master title style
• Parallel data processing involves the simultaneous execution of
multiple sub-tasks that collectively comprise a larger task.
• The goal is to reduce the execution time by dividing a single larger
task into multiple smaller tasks that run concurrently.
• Although parallel data processing can be achieved through multiple
networked machines it is more typically achieved within the confines
of a single machine with multiple processors or cores.
• Parallel processing mostly based on share-every-things
architecture.
3 3
Parallel
Click Data
to edit Processing
Master cont..
title style
4 4
Distributed
Click to edit Data Processing
Master title style
• Distributed data processing is closely related to parallel data
processing in that the same principle of “divide-and-conquer” is
applied.
• However, distributed data processing is always achieved through
physically separate machines that are networked together as a cluster.
• The distributed system based on share-nothings architecture except
network switch.
5 5
Click to editWorkloads
Processing Master title style
• A processing workload in Big Data is defined as the
amount and nature of data that is processed within a
certain amount of time Workloads are usually divided
into two types:
• Batch
• Transactional
6 6
Batchto edit Master title style
Click
• Batch processing, also known as offline processing, involves
processing data in batches and usually imposes delays, which in
turn results in high-latency responses.
• Batch workloads typically involve large quantities of data with
sequential read/writes and comprise of groups of read or write
queries.
• Queries can be complex and involve multiple joins. Strategic BI
and analytics are batch-oriented as they are highly read-intensive
7 7
Batch
Click to edit Master title style
8 8
Transactional
Click to edit Master title style
• Transactional processing is also known as online processing.
• Transactional workload processing follows an approach whereby
data is processed interactively without delay, resulting in low-
latency responses.
• Transaction workloads involve small amounts of data with reads
and writes.
9 9
Processing
Click in Realtime
to edit Master Mode
title style
• Realtime mode addresses the velocity characteristic of Big Data
datasets.
• Within Big Data processing, realtime processing is also called
event or stream processing as the data either arrives
continuously (stream) or at intervals (event)
• Another related term, interactive mode, falls within the category
of realtime. Interactive mode generally refers to query
processing in realtime. Operational BI/analytics are generally 10
10
Speed
Click to Consistency Volume
edit Master title style (SCV)
• Speed – Speed refers to how quickly the data can be
processed once it is generated. In the case of realtime
analytics, data is processed comparatively faster than
batch analytics. This generally excludes the time taken to
capture data and focuses only on the actual data
processing, such as generating statistics or executing an
algorithm.
• Consistency – Consistency refers to the accuracy and the
precision of the results. Results are deemed accurate if they
are close to the correct value and precise if close to each
other. A more consistent system will make use of all
available data, resulting in high accuracy and precision as
compared to a less consistent system that makes use of 11
11
sampling techniques, which can result in lower accuracy
Click
Speed toConsistency
edit MasterVolume
title style
(SCV)
• Volume – Volume refers to the amount
of data that can be processed. Big
Data’s velocity characteristic results in
fast growing datasets leading to huge
volumes of data that need to be
processed in a distributed manner.
Processing such voluminous data in its
entirety while ensuring speed and
consistency is not possible.
12
12
• If speed (S) and consistency (C) are required, it is not possible to process
Click
high to
Speed editof MasterVolume
Consistency
volumes title style
(SCV)
data (V) because large amounts of data slow down data processing.
• If consistency (C) and processing of high volumes of data (V) are
required, it is not
possible to process the data at high speed (S) as achieving high speed data
processing
requires smaller data volumes.
• If high volume (V) data processing coupled with high speed (S) data
processing is
required, the processed results will not be consistent (C) since high-speed
processing of
13
13
large amounts of data involves sampling the data, which may reduce
Realtime
Click BigMaster
to edit Data processing
title style
Event Stream Processing
• During ESP, an incoming stream of events, generally from a single
source and ordered by time, is continuously analyzed.
• Other (memory resident) data sources can also be incorporated
into the analysis for performing richer analytics.
• The processing results can be fed to a dashboard or can act as a
trigger for another application to perform a preconfigured action or
further analysis.
• ESP focuses more on speed than complexity
14
14
Complex
Click Event
to edit Processing
Master title style
• During CEP, a number of realtime events often coming from
disparate sources and arriving at different time intervals are
analyzed simultaneously for the detection of patterns and
initiation of action.
• CEP focuses more on complexity, providing rich analytics.
However, as a result, speed of execution may be adversely
affected.
• In general, CEP is a superset of ESP and often the output of ESP
15
results in the generation of synthetic events that can be fed into