0% found this document useful (0 votes)
10 views24 pages

big data

The document provides an overview of Big Data, defining it as large, complex datasets that require advanced analytics for effective management. It discusses various applications across industries, including social media, aviation, and retail, highlighting the benefits and challenges of Big Data processing. Additionally, it outlines tools used for Big Data management and emphasizes the need for skilled professionals in the field.

Uploaded by

Shafeen Nagoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views24 pages

big data

The document provides an overview of Big Data, defining it as large, complex datasets that require advanced analytics for effective management. It discusses various applications across industries, including social media, aviation, and retail, highlighting the benefits and challenges of Big Data processing. Additionally, it outlines tools used for Big Data management and emphasizes the need for skilled professionals in the field.

Uploaded by

Shafeen Nagoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CA3618 - INTRODUCTION TO BIG DATA

Unit I: Introduction to Big data

 Big Data refers to extremely large datasets that are complex and grow rapidly, making them
difficult to store, manage, and analyze using traditional data processing tools. These datasets
are characterized by their size, variety, and speed.

 Big data is similar to regular data, except it is much larger.

 Big data is a term that defines the massive amount of organized and unstructured data that a
company encounters on a daily basis.

 Big data analytics is the use of advanced analytic techniques to very large, heterogeneous
data sets, which can contain structured, semi-structured, and unstructured data, as well as
data from many sources and sizes ranging from terabytes to zettabytes.

Examples of Big Data

1. Social Media (Facebook)

 Data Generated: 500+ Terabytes per day


Facebook generates enormous volumes of data daily. This data is primarily generated through:

o Photo and Video Uploads: Billions of photos and videos are uploaded every day by
users from across the world.
o Messages: Billions of private messages are exchanged between users.
o Comments and Posts: The platform's activity includes status updates, comments, likes,
and shares, contributing massive amounts of data.

1
MSN
 Big Data Challenge: Managing and processing this data for various purposes such as user
engagement, advertisement targeting, content recommendations, and user behavior analysis.

2. Aviation Industry (Jet Engine Data)

 Data Generated: 10+ Gigabytes per 30 minutes of flight


Modern jet engines are equipped with sensors that generate data at an incredibly high rate
during each flight. This data includes information on:

o Engine Performance: Data on temperature, pressure, speed, fuel consumption, and


vibration.
o Maintenance Metrics: Information that can be used to predict maintenance needs
and avoid malfunctions.
o Flight Path Data: Location, altitude, and airspeed metrics.

 Big Data Challenge: Managing, storing, and analyzing this vast data in real-time, which
helps improve flight safety, optimize fuel consumption, and predict maintenance schedules.

3. Stock Market (New York Stock Exchange)

 Data Generated: Around 1 Terabyte per day


The NYSE generates massive amounts of data each day, which includes:
o Stock Trading Data: Transaction records, including price, volume, and time for each
stock trade.
o Market Data: Information such as market fluctuations, stock price changes, and

trend analysis.
 Big Data Challenge: Analyzing this data in real-time to make quick trading decisions,
predict stock trends, and improve financial modeling.

2
MSN
Characteristics of Big Data

Volume

The term 'Big Data' refers to a massive amount of information. The term "volume" refers to a large
amount of data. The magnitude of data plays a critical role in determining its worth. When the
amount of data is extremely vast, it is referred to as 'Big Data.'

Velocity

It refers to the speed at which data is generated and needs to be processed. Big Data is not only
about collecting data but also about analyzing it in real time or near-real time to make timely
decisions.

Variety

It refers to the different types of data Big Data encompasses. Big Data sources include a wide
variety of data types such as:

 Structured data is just data that has been arranged. It usually refers to data that has been
specified in terms of length and format.

 Semi-structured data is a type of data that is semi-organized. It's a type of data that doesn't
follow the traditional data structure. This sort of data is represented by log files.

 Unstructured data is just data that has not been arranged. It usually refers to data that
doesn't fit cleanly into a relational database's standard row and column structure. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form
of rows and columns.

3
MSN
Applications of BIG DATA

Big Data plays a crucial role across various industries by helping organizations harness large
volumes of data for strategic decision-making, operational efficiency, and creating better customer
experiences.

Below are detailed examples of how Big Data is utilized across different sectors:

1. Tracking Customer Spending Habits and Shopping Behavior

 Retail Sector: Large retail stores like Amazon, Walmart, and Big Bazaar collect extensive
data on customer spending habits, shopping frequency, preferred brands, and frequently
purchased products. By analyzing this data, companies can:

 Identify the most popular products, ensuring they are stocked up.
 Predict future demand for specific products based on trends.
 Adjust production rates accordingly.
 Banks can also leverage this data to offer personalized promotions to customers based
on their purchasing behavior. For example, offering discounts or cashback when
customers use bank cards for products they frequently buy.

2. Recommendation Systems

 E-commerce: E-commerce platforms like Amazon, Flipkart, and Walmart track customers'
previous searches and purchases to recommend related products.
 For example, if a customer buys a bed cover, the platform might suggest other home
decor items. Additionally, when the customer searches for a product like a "bed
cover," targeted advertisements for similar products appear on various sites (Google,
YouTube, etc.).
 YouTube: YouTube uses Big Data to recommend videos based on previous viewing history,
which helps users discover new content.
 For instance, if someone watches tutorials on Big Data, they may be shown ads or
recommendations for Big Data courses or related content.

3. Smart Traffic System

 Urban Planning and Transportation: Big Data is crucial for developing smart traffic
systems. Through data gathered from traffic cameras, GPS-enabled vehicles (e.g., Uber, Ola),
and road sensors, traffic management systems can:

 Identify congestion areas in real-time.


 Recommend alternative routes to reduce traffic and travel time.

4
MSN
 Minimize fuel consumption by suggesting less congested roads.

4. Secure Air Traffic Systems

 Aviation: Aircraft sensors generate vast amounts of data about the plane's performance,
including speed, moisture, and environmental conditions. This data helps:

 Monitor the aircraft's status in real-time, ensuring smooth operation.


 Predict the lifespan of components (e.g., engines) and identify when repairs or
replacements are needed.
 Optimize flight routes and fuel efficiency by analyzing weather and other external
conditions.

5. Auto-Driving Cars

 Autonomous Vehicles: Self-driving cars rely on Big Data collected from multiple sensors,
cameras, and other devices. This data helps the car:

 Identify obstacles, nearby vehicles, and traffic signals.


 Calculate speed, stopping distance, and other parameters for safe driving.
 Make decisions in real-time, such as when to slow down, turn, or stop.

6. Virtual Personal Assistant Tools

 Personalized Assistance: Tools like Siri (Apple), Cortana (Microsoft), and Google Assistant
use Big Data to respond to user queries. These virtual assistants analyze data like:

 The user's location, time of day, weather, and personal preferences.


 Based on this data, they can answer questions such as "Do I need to carry an
umbrella today?" by checking the weather data for the user's location.

7. Internet of Things (IoT)

 Manufacturing and Healthcare: IoT devices generate massive amounts of data that can be
analyzed for operational efficiency and predictive maintenance.

 Manufacturing: Sensors on machines monitor their operational status, and Big


Data analytics predict when machines need repairs, preventing costly
breakdowns.
 Healthcare: IoT-enabled devices track patients' health metrics (e.g., heart rate,
blood pressure) and trigger alerts when a parameter goes beyond a safe range.
This enables doctors to intervene remotely and provide timely care.

5
MSN
8. Education Sector

 Targeted Learning: Online educational platforms analyze user data, such as search queries
or videos watched on specific subjects, to target prospective students with advertisements or
course recommendations.
 For instance, if someone watches tutorials on Python, they might see ads for coding boot
camps or online courses in Python.

9. Energy Sector

 Smart Energy Management: Smart meters installed in households and manufacturing units
collect data on electricity usage in real-time. This data is then analyzed to:

 Determine peak and off-peak times for electricity consumption.


 Recommend optimal times for heavy machinery to run to reduce electricity costs.
 Help consumers manage their electricity usage more efficiently and lower their bills.

10. Media and Entertainment Sector

 Consumer Preferences: Streaming services like Netflix, Spotify, and Amazon Prime collect
data on user behavior, such as:

 What types of videos, movies, or music users engage with most.


 How long they spend on the platform and their content preferences.
 Based on this data, these platforms recommend content, optimize user interfaces, and
even plan future content strategies.
 They can also target advertisements based on user preferences, increasing the
relevance of the ads shown.

Benefits of Big Data Processing

1. Better Decision-Making

 Big Data analytics provides valuable insights, enabling data-driven decisions.


 Example: E-commerce platforms recommend products based on user behavior.

2. Enhanced Customer Experience

 Personalized services can be offered by analyzing customer preferences.


 Example: Netflix suggests shows/movies based on viewing history.

6
MSN
3. Fraud Detection and Risk Management

 Analyzing large datasets helps detect fraudulent activities in real time.


 Example: Banks identify unusual transaction patterns to prevent fraud.

5. Healthcare Advancements

 Improves diagnosis, treatment, and patient care by analyzing large medical datasets.
 Example: Big Data in genomics helps predict and prevent diseases.

6. Innovation and Product Development

 Helps companies innovate faster by understanding market trends and customer needs.
 Example: Automobiles with smart features driven by data insights (like Tesla).

7. Improved Security

 Real-time monitoring and analysis of data enhance cyber security.


 Example: Identifying and neutralizing network threats instantly.

Tools used in BIG Data

There are number of tools used in BIGDATA. Most popular tools are: -

1. Apache Hadoop
 An open-source platform that stores and distributes large data sets using computer
clusters.
 It's one of the most powerful big data technologies, with the ability to grow from a
single server to thousands of computers.
 It brings flexibility in Data Processing. It allows for faster data Processing.

2. Apache STORM
 Storm is a free big data open source computation system.
 It is one of the best big data tools which offers distributed real-time, fault-tolerant
processing system.
 It has big data technologies and tools that uses parallel calculations that run across a
cluster of machines.

3. Qubole
 Qubole Data is autonomous big data management platform.
 It is a big data open-source tool which is self-managed, self-optimizing and allows the
data team to focus on business outcomes.

7
MSN
4. Apache Cassandra
 The Apache Cassandra database is widely used today to provide an effective
management of large amounts of data.
 It is one of the best big data tools which is most suitable for applications that can't
afford to
 lose data, even when an entire data center is down
 Cassandra offers support contracts and services are available from third parties

5. Pentaho
 Pentaho provides big data tools to extract, prepare and blend data.
 It offers visualizations and analytics that change the way to run any business.
 This Big data tool allows turning big data into big insights.

6. Apache Flink
 Apache Flink is one of the best open source data analytics tools for stream processing
big data.
 It is distributed, high-performing, always-available, and accurate data streaming
applications.
 It Provides results that are accurate, even for out-of-order or late-arriving data

7. Open Refine
 OpenRefine is a powerful big data tool.
 It is a big data analytics software that helps to work with messy data, cleaning it and
transforming it from one format into another.
 It also allows extending it with web services and external data.
8. RapidMiner
 RapidMiner is one of the best open-source data analytics tools.
 It is used for data prep, machine learning, and model deployment.
 It offers a suite of products to build new data mining processes and setup predictive
analysis.
9. Kaggle
 Kaggle is the world's largest big data community.
 It helps organizations and researchers to post their data & statistics. It is the best place
to analyze data seamlessly.
10. Apache Hive
 Hive is an open-source big data software tool.
 It allows programmers analyze large data sets on Hadoop.
 It helps with querying and managing large datasets real fast.

8
MSN
5 Challenges in BIG DATA

1. Lack of Proper Understanding

Problem:

 Many employees lack a clear understanding of data storage, processing, and its
importance.
 Only data professionals are aware of the technical aspects, while others in the
organization may remain uninformed.

Example:

If employees don’t understand the importance of maintaining a backup, crucial customer data
might be lost in case of a server failure. This can disrupt business operations and harm
customer trust.

Solution:

 Workshops and Seminars: Conduct Big Data training sessions for all employees
involved in data handling.
 Basic Training Programs: Ensure every employee understands key data concepts
and storage protocols.
 Company-wide Awareness: Foster a data-driven culture across all levels of the
organization.

2. Data Growth Issues

Problem:

 The volume of data is increasing exponentially, with most of it being unstructured


(e.g., text files, videos, images, social media content).
 Traditional storage systems cannot handle this massive data efficiently.

Example:

An e-commerce company stores customer data, purchase history, product images, and user
reviews. As its customer base grows, managing these datasets becomes increasingly
challenging, leading to slower processing times and higher costs.

Solution:

1. Compression: Reduce data size to save storage space.

9
MSN
2. Deduplication: Remove duplicate data to avoid redundancy.
3. Data Tiering: Classify and store data based on its importance in appropriate storage
tiers (public cloud, private cloud, flash storage).
4. Big Data Tools: Use advanced tools like Hadoop, NoSQL databases, and Apache
Spark to manage and process large data sets efficiently.

3. Confusion While Big Data Tool Selection

Problem:

 Companies struggle to choose the right tools for data storage and analysis due to the
wide variety of options available.
 Common dilemmas include:
o Hadoop vs. Apache Spark for data processing
o HBase vs. Cassandra for data storage

Example:

A company may select Hadoop for real-time analytics, but it later realizes that Apache
Spark would have been a better choice for faster processing and machine learning tasks.
This mistake leads to delays and additional expenses.

Solution:

1. Hire Experienced Professionals: Bring in data experts with experience in selecting


and implementing Big Data tools.
2. Seek Big Data Consulting: Consultants can assess the company’s needs and
recommend the best tools and strategies.
3. Evaluate Use Cases: Understand the company's specific data requirements and
match them with appropriate tools.

4. Lack of Data Professionals

Problem:

 There is a shortage of skilled professionals such as data scientists, data analysts, and
data engineers who can handle Big Data tools and extract meaningful insights.
 The rapid evolution of data handling technologies has outpaced the development of
an adequately trained workforce.

10
MSN
Example:

A healthcare company implementing a Big Data project to analyze patient records may
struggle to find experienced professionals who can use tools like Hadoop, Spark, and Python
for predictive analytics. This results in project delays and incomplete analysis.

Solution:

1. Invest in Recruitment and Training: Hire experienced professionals and provide


regular training for existing employees to upskill them.
2. AI/ML-Powered Tools: Use tools with built-in artificial intelligence (AI) and
machine learning (ML) features that require minimal expertise.
3. Collaborate with Universities: Partner with educational institutions to develop
training programs and create a pipeline of skilled data professionals.

5. Securing Data

Problem:

 As data volumes increase, securing large datasets becomes a major challenge.


 Companies are so focused on storing, processing, and analyzing data that they may
neglect data security.
 Unprotected data repositories are vulnerable to breaches, hacking, and theft.

Example:

A retail company storing customer payment details in its data center experiences a data
breach. Sensitive information is stolen, leading to financial loss of over $3.7 million and
damage to the company’s reputation.

Solution:

1. Data Encryption: Encrypt sensitive data to protect it from unauthorized access


during storage and transmission.
2. Identity and Access Control: Ensure only authorized personnel can access critical
data.
3. Endpoint Security: Protect devices used to access and store data from malware and
other threats.
4. Real-time Security Monitoring: Continuously monitor data for any unusual activity
or potential breaches.
5. Data Segregation: Store sensitive data in isolated environments, ensuring it's better
protected.

11
MSN
File System Concept

 A File System is a method and data structure that the operating system uses to manage files
on a storage device (like hard drives or SSDs).
 It defines how data is stored, organized, and accessed, including how files and directories are
created, named, stored, and modified.
 In the context of Big Data, file systems are crucial because they handle large volumes of data
that need to be stored efficiently and retrieved quickly for processing.

Distributed File System (DFS) Concept

 A Distributed File System (DFS) is a file system that allows files to be stored and accessed
across multiple machines or nodes in a network, rather than being confined to a single
machine. In the context of Big Data, DFS is crucial for efficiently managing and processing
vast amounts of data across a distributed environment.
 DFS allows data to be stored in a distributed manner, spread across multiple devices.
 This distributed architecture provides high scalability, fault tolerance, and efficient access to
large datasets, which are essential for Big Data applications.
 A DFS typically integrates with frameworks like Hadoop, Spark, and NoSQL databases to
support large-scale data storage and analytics.

Components of DFS

DFS relies on two core components to function efficiently:

1. Location Transparency:

 This is achieved through the namespace component.


 Location Transparency ensures that users and applications don’t need to know
the actual physical location of data.
 They simply refer to files by their names and paths, and the DFS handles the task
of retrieving data from the right location.
 This makes the file system easier to interact with, and users don't need to worry
about which node or server the data resides on.

2. Redundancy:

 Redundancy is implemented through file replication.

12
MSN
 DFS ensures that copies of data are stored across different nodes to protect against
data loss.
 If one node or server fails, the replicated copy on another node ensures that the
data is still available. This contributes to high availability and fault tolerance.

Features of DFS

DFS provides several features that make it efficient for handling large datasets:

1. Transparency:
DFS ensures that the underlying complexities are hidden from the user:
o Structure Transparency: Users do not need to know the number or location of file
servers. The file system automatically manages this.

o Access Transparency: Local and remote files are accessed in the same way. There
should be no distinction for the client when accessing files stored on local machines
versus files stored remotely across a network.
o Naming Transparency: The name given to a file should not reveal its location. Once
a file is named, its name should remain consistent even if the file is moved from one
node to another in the distributed system..

o Replication Transparency: If a file is replicated across nodes, the system handles


accessing the correct replica, so users are unaware of the multiple copies.
2. User Mobility:
DFS ensures that when a user logs into any node, their home directory and associated data
are automatically made available. This is especially useful in cloud-based and distributed
computing environments.

3. Performance:
DFS aims to deliver performance comparable to a centralized file system by efficiently
distributing the load across multiple nodes. This includes CPU time, storage access time, and
network access time. DFS strives to balance these elements to achieve fast access times.

4. Simplicity and Ease of Use:


The user interface of DFS is designed to be simple, with minimal commands and
straightforward file management, so users do not need to worry about the distribution details.

5. High Availability:
DFS ensures data is always available, even in the case of node, link, or storage drive failures.
It uses data replication and fault-tolerant mechanisms to keep the system running smoothly.

13
MSN
Working of DFS

A Distributed File System works through the following mechanisms:

1. Distribution

 Data Distribution: DFS splits data into smaller blocks or chunks, and these blocks are
distributed across multiple nodes or clusters. Each node has its own computing power,
allowing for parallel processing of data.

 Parallel Processing: Since the data is distributed across nodes, each node can work on its
portion of data simultaneously, providing significant computational power and faster
processing times.

2. Replication

 Data Replication: DFS replicates data blocks across different nodes or clusters. By copying
data onto multiple machines (often in separate racks or physical locations), the system
achieves fault tolerance. If one cluster or node fails, the data is still available from another
replica.

 Fault Tolerance: In the event of a failure, the DFS system can retrieve the data from its
replicated copies, ensuring continuous access to the data. This feature makes the system
robust and reliable.
o Challenges with Data Replication: One challenge in DFS is maintaining
consistency between replicas. When a data block is modified on one node, the
changes must be reflected in all replicas across the system, which can be complex,
especially with frequent updates.

 High Concurrency: The replication of data also enables high concurrency, meaning
multiple clients or nodes can access and process the same data simultaneously. Each node
uses its own computational resources to process a part of the data, improving overall system
performance.

3. Fault Tolerance and High Concurrency

Fault Tolerance: DFS is designed to handle failures gracefully. If a node or rack fails, data is still
accessible from other replicas. Data replication is a good way to achieve fault tolerance and high
concurrency; but it’s very hard to maintain frequent changes. Assume that someone changed a data
block on one cluster; these changes need to be updated on all data replica of this block.

14
MSN
High Concurrency: DFS allows for multiple clients to simultaneously access and process the same
data. Each client may access different parts of the data or the same data in parallel, enhancing the
system's performance by utilizing the computational power of each node.

Advantages of Distributed File System (DFS)

• Scalability:
DFS allows you to scale up your infrastructure by adding more racks or clusters as your data
storage and computational needs grow. As more nodes are added, the system can handle
larger amounts of data and more client requests, which is essential in Big Data environments.

• Fault Tolerance:
Data replication will help to achieve fault tolerance in the following cases:
• Cluster is down
• Rack is down
• Rack is disconnected from the network.
• Job failure or restart.

• High Concurrency:
DFS enables high concurrency by allowing multiple clients to access and process data
simultaneously across various nodes. By replicating data across multiple clusters, the system
can serve many requests at the same time, maximizing the computing power of each node.

 DFS allows multiple users to access or store the data.


 It allows the data to be share remotely.
 It improved the availability of file, access time and network efficiency.
 Improved the capacity to change the size of the data and also improves the ability to exchange
the data.
 Distributed File System provides transparency of data even if server or disk fails.

Disadvantages of Distributed File System (DFS)

• In Distributed File System nodes and connections needs to be secured therefore we can say that
security is at stake.
• There is a possibility of loss of messages and data in the network while movement from one node
to another.
• Database connection in case of Distributed File System is complicated.
• Also handling of the database is not easy in Distributed File System as compared to a single user
system.
• There are chances that overloading will take place if all nodes try to send data at once.

15
MSN
Scalable Computing Over the Internet

 Scalable Computing in Big Data refers to the ability of a system to handle the growing
volume of data and increased computational demands by adding resources such as computing
power, storage, or network bandwidth.
 Scalable computing is crucial in Big Data environments, where data grows exponentially,
and real-time processing is often required.

Benefits of Scalable Computing in Big Data:

 Efficient handling of massive data volumes


 Cost-effectiveness with resource optimization
 Enhanced system performance and availability
 Adaptability to changing data demands

Real-World Examples:

1. E-commerce (e.g., Amazon, Flipkart): Real-time scalable computing enables processing


millions of transactions and user data simultaneously.
2. Social Media Platforms (e.g., Facebook, Twitter): Handle and analyze massive amounts of
user-generated data in real time, ensuring a seamless experience for billions of users.
3. Healthcare and Genomics: Scalable computing is used for processing vast genomic datasets
and real-time analysis of health records

Scalable Computing Over the Internet

• The Age of Internet Computing


• High-Performance Computing
• High-Throughput Computing
• Three New Computing Paradigms
• Computing Paradigm Distinctions
• Distributed System Families
• Degrees of Parallelism

1. The Age of Internet Computing

In the past, computers worked on their own or in small networks. With the internet’s rise, computing
became global. Now, computers can share resources and power through the internet.

 Before: Personal computers did most work.


 Now: Cloud computing, web services, and apps do the work online.
Example: Instead of installing software, we use Google Docs or stream videos on YouTube.

16
MSN
2. High-Performance Computing (HPC)

HPC = Super-fast computing for complex tasks.


It focuses on solving big problems quickly by using supercomputers or clusters (groups of powerful
computers working together).

 Used for: Weather prediction, rocket design, and medical research.


 How it works: Splits one complex task into smaller tasks and runs them all at the same time.
Example: Simulating climate changes over the next 50 years.

3. High-Throughput Computing (HTC)

HTC = Completing a lot of tasks over time.


It’s not about speed but about handling thousands or millions of smaller tasks continuously.

 Used for: Analyzing big data, processing bank transactions, bioinformatics (DNA
sequencing).
 How it works: Queues up tasks and processes them one by one across multiple computers.
Example: A bank processing millions of transactions daily.

4. Three New Computing Paradigms


These are new ways of using computers that have changed how we solve problems:

1. Cloud Computing
o On-demand resources like storage, servers, and applications over the internet.
o Example: Google Drive, AWS (Amazon Web Services).
2. Grid Computing
o Uses computers from different locations to work on one big task.
3. Edge Computing
o Processes data closer to where it’s created (like sensors or IoT devices) to reduce
delay.
o Example: Smart traffic lights in cities or autonomous vehicles.

5. Computing Paradigm Distinctions


Paradigm What it Does Example
Cloud Computing On-demand resources over the internet Google Cloud, Dropbox
Sharing distributed computers for a common CERN’s Large Hadron
Grid Computing
task Collider
High-Performance
Supercomputers for fast calculations NASA simulations
Computing
High-Throughput
Processing lots of small tasks DNA sequencing
Computing
Smart homes, autonomous
Edge Computing Local data processing on devices
cars

17
MSN
6. Distributed System Families
Distributed systems are groups of computers working together to solve tasks.
Here are the main types:

1. Cluster Computing
o Many computers act like a single machine.
o Example: Google’s data centers for search engine results.
2. Grid Computing
o Computers from different locations connect to work on big tasks.
3. Peer-to-Peer (P2P) Systems
o No central server; computers share resources equally.
o Example: Bit Torrent for file sharing.
4. Cloud Systems
o Provides scalable services like storage and virtual machines.
o Example: Microsoft Azure, AWS.

7. Degrees of Parallelism

Parallelism means doing many tasks at the same time to speed things up. There are different levels
of parallelism:

1. Bit-Level Parallelism

o Processes smaller chunks of data faster.


o Example: Modern processors are 64-bit, processing more data at once than older 32-
bit ones.

2. Instruction-Level Parallelism

o Executes multiple instructions in a single clock cycle.


o Example: A CPU pipeline can run several instructions simultaneously.
3. Task Parallelism

o Runs different tasks on separate processors at the same time.


o Example: Video games running graphics, audio, and physics calculations
simultaneously.

4. Data Parallelism
o Splits a large dataset into smaller parts and processes them at once.
o Example: Analyzing large customer databases in parallel to find patterns.

18
MSN
Popular models for big data

1. MapReduce

MapReduce is a programming model used for processing large amounts of data across many
computers. It splits data processing into two main steps:

1. Map – Breaks the input into smaller parts (key-value pairs).


2. Reduce – Combines the results of the Map step to produce the final output.

It is mostly used in Hadoop to handle big data efficiently.

Working of MapReduce:

1. Map Stage:
o The map task takes input data, typically stored in a distributed file system like
Hadoop Distributed File System (HDFS).
o It processes the input line-by-line and converts each line into a set of key-value pairs
(tuples).
o These key-value pairs are the output of the map task, and they represent the
intermediate results.
2. Shuffle Stage:
o This stage is responsible for sorting and transferring the data produced by the map
function to the appropriate reducers.
o The shuffle phase groups the data by key, ensuring that all values for the same key
are brought together. This is where the data is sorted and distributed.
3. Reduce Stage:
o The reduce task takes the shuffled and grouped key-value pairs as input.
o It then processes the data, combining or aggregating the values based on the keys.
o The output of the reduce stage is a smaller set of key-value pairs, which represents the
final result of the computation.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.

19
MSN
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network
traffic. After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server

Advantages of MapReduce:

 Scalability: It is easy to scale data processing over multiple machines in a cluster. As the
amount of data grows, the model allows applications to be scaled by simply increasing the
number of nodes.
 Fault Tolerance: MapReduce provides built-in fault tolerance by reassigning tasks if a
failure occurs.
 Data Locality: The data is processed where it resides (on local disks), reducing network
traffic and improving performance.
 Parallelism: The map and reduce functions run in parallel, which significantly speeds up the
processing of large datasets.

2. Directed Acyclic Graph


A Directed Acyclic Graph (DAG) is a popular model used to represent tasks and their
dependencies in big data processing.

A Directed Acyclic Graph (DAG) is a directed graph that has no directed cycles. It consists of
vertices (or nodes) connected by edges, where each edge has a direction pointing from one vertex to
another.

Key Characteristics of DAG:

1. Directed: Each edge has a direction, meaning it points from one vertex to another.

20
MSN
2. Acyclic: There are no cycles in the graph, meaning it’s impossible to start from a vertex and
return to it by following a series of directed edges.
3. Topologically Ordered: A DAG can always be arranged in a way such that for every
directed edge u → v, vertex u comes before vertex v in the ordering.

Example of a DAG

 There are directed edges between nodes.


 No cycles exist (i.e., no node can be reached again after moving forward).
 We can apply topological sorting.

Topological Sorting of DAG

Topological sorting is a linear ordering of vertices such that for every directed edge (u → v), vertex
u comes before v.

Possible topological sorts for the above graph:

1. 5 → 4 → 2 → 3 → 1 → 0
2. 4 → 5 → 2 → 3 → 1 → 0

There can be multiple valid topological orderings.

Applications of DAG

1. Workflow Scheduling in Big Data Processing


o DAGs are widely used in big data frameworks like Apache Spark, Apache Airflow,
and Tensor Flow to represent execution tasks in a pipeline.
o Tasks are executed sequentially based on dependency order.
2. Data Processing Pipelines
o DAGs help in structuring complex data processing workflows where each task
depends on the completion of others (e.g., ETL processes).

21
MSN
3. Job Scheduling in Operating Systems
o Used to determine the execution order of jobs in operating systems and database
query optimization.
4. Artificial Intelligence and Machine Learning
o Neural networks and probabilistic graphical models (e.g., Bayesian Networks) use
DAG structures for inference and learning.
5. Block chain and Crypto currencies
o Some crypto currencies (e.g., IOTA and Nano) use DAG instead of block chain to
improve scalability and reduce transaction costs.

3. Message Passing
 Process communication is the mechanism provided by the operating system that allows
processes to communicate with each other.
 One of the models of process communication is the message passing model.
 Message passing model allows multiple processes to read and write data to the message
queue without being connected to each other.

How Message Passing Works

1. Processes communicate via a message queue instead of directly sharing memory.


2. Messages are stored in a queue until the recipient retrieves them.
3. Processes can send, receive, or wait for messages as needed.

Example:

 Process P1 sends a message to a queue.


 Process P2 retrieves and processes the message.

Advantages

 No Shared Memory Required – Works well in distributed systems.


 Asynchronous Communication – Messages can be processed later.
 Fault Tolerance – Messages remain in the queue even if a process crashes.
 Scalability – Easily scales across multiple machines.
 Security – No risk of data corruption due to shared memory issues.

Disadvantages

 Slower than Shared Memory – Message transmission has overhead.


 Requires Synchronization – Sender and receiver coordination needed.
 Message Queue Overhead – Managing large queues can be complex.

22
MSN
Step-by-Step Approach to Workflow Orchestration for Big Data

 Workflow orchestration in Big Data involves managing and automating complex data
pipelines to ensure efficient execution and coordination of tasks.
 It involves managing dependencies between tasks, handling failures, optimizing
performance, and ensuring smooth data flow across different systems and services.

Below is a step-by-step guide to orchestrating workflows in a Big Data environment.

Step 1: Define Workflow Requirements

 Identify business objectives and key deliverables.


 Determine the data sources, formats, and expected volume.
 Define dependencies between tasks.
 Establish performance, latency, and fault tolerance requirements.

Step 2: Choose the Right Orchestration Tool

Some popular tools include:

 Apache Airflow – Best for complex DAG-based workflows.


 Apache Oozie – Suitable for Hadoop ecosystem workflows.
 Luigi – Great for dependency management.
 Argo Workflows – Kubernetes-native orchestration.
 AWS Step Functions, Azure Data Factory, Google Cloud Composer – Cloud-based
solutions.

Step 3: Design the Workflow

 Break the workflow into smaller


 Define task dependencies (sequential, parallel, and conditional).
 Consider error handling and retry mechanisms.
 Design the workflow using Directed Acyclic Graphs (DAGs.

Step 4: Implement and Configure Tasks

 Use Python, SQL, or Spark jobs to implement data processing tasks.


 Define scheduling parameters
 Use version control (Git) to track changes.

Step 5: Test the Workflow

 Run unit tests on individual tasks.


 Perform integration tests for the end-to-end pipeline.
 Simulate failure scenarios and test fault tolerance mechanisms.

23
MSN
Step 6: Deploy the Workflow

 Use CI/CD (Continuous Integration/Continuous Deployment) pipelines for automated


deployment.
 Deploy in a staging environment before production.
 Ensure monitoring tools (e.g., Prometheus, Grafana) are in place.

Step 7: Monitor and Optimize

 Set up logging and alerts for failures


 Monitor execution time and resource utilization.
 Optimize scheduling and parallelism for efficiency.
 Implement auto-scaling based on workload demand.

Step 8: Maintain and Evolve

 Regularly update workflow definitions to accommodate new data sources.


 Apply security best practices (e.g., role-based access control, data encryption).
 Continuously improve based on performance metrics and feedback

****************************************

24
MSN

You might also like