big data
big data
Big Data refers to extremely large datasets that are complex and grow rapidly, making them
difficult to store, manage, and analyze using traditional data processing tools. These datasets
are characterized by their size, variety, and speed.
Big data is a term that defines the massive amount of organized and unstructured data that a
company encounters on a daily basis.
Big data analytics is the use of advanced analytic techniques to very large, heterogeneous
data sets, which can contain structured, semi-structured, and unstructured data, as well as
data from many sources and sizes ranging from terabytes to zettabytes.
o Photo and Video Uploads: Billions of photos and videos are uploaded every day by
users from across the world.
o Messages: Billions of private messages are exchanged between users.
o Comments and Posts: The platform's activity includes status updates, comments, likes,
and shares, contributing massive amounts of data.
1
MSN
Big Data Challenge: Managing and processing this data for various purposes such as user
engagement, advertisement targeting, content recommendations, and user behavior analysis.
Big Data Challenge: Managing, storing, and analyzing this vast data in real-time, which
helps improve flight safety, optimize fuel consumption, and predict maintenance schedules.
trend analysis.
Big Data Challenge: Analyzing this data in real-time to make quick trading decisions,
predict stock trends, and improve financial modeling.
2
MSN
Characteristics of Big Data
Volume
The term 'Big Data' refers to a massive amount of information. The term "volume" refers to a large
amount of data. The magnitude of data plays a critical role in determining its worth. When the
amount of data is extremely vast, it is referred to as 'Big Data.'
Velocity
It refers to the speed at which data is generated and needs to be processed. Big Data is not only
about collecting data but also about analyzing it in real time or near-real time to make timely
decisions.
Variety
It refers to the different types of data Big Data encompasses. Big Data sources include a wide
variety of data types such as:
Structured data is just data that has been arranged. It usually refers to data that has been
specified in terms of length and format.
Semi-structured data is a type of data that is semi-organized. It's a type of data that doesn't
follow the traditional data structure. This sort of data is represented by log files.
Unstructured data is just data that has not been arranged. It usually refers to data that
doesn't fit cleanly into a relational database's standard row and column structure. Texts,
pictures, videos etc. are the examples of unstructured data which can’t be stored in the form
of rows and columns.
3
MSN
Applications of BIG DATA
Big Data plays a crucial role across various industries by helping organizations harness large
volumes of data for strategic decision-making, operational efficiency, and creating better customer
experiences.
Below are detailed examples of how Big Data is utilized across different sectors:
Retail Sector: Large retail stores like Amazon, Walmart, and Big Bazaar collect extensive
data on customer spending habits, shopping frequency, preferred brands, and frequently
purchased products. By analyzing this data, companies can:
Identify the most popular products, ensuring they are stocked up.
Predict future demand for specific products based on trends.
Adjust production rates accordingly.
Banks can also leverage this data to offer personalized promotions to customers based
on their purchasing behavior. For example, offering discounts or cashback when
customers use bank cards for products they frequently buy.
2. Recommendation Systems
E-commerce: E-commerce platforms like Amazon, Flipkart, and Walmart track customers'
previous searches and purchases to recommend related products.
For example, if a customer buys a bed cover, the platform might suggest other home
decor items. Additionally, when the customer searches for a product like a "bed
cover," targeted advertisements for similar products appear on various sites (Google,
YouTube, etc.).
YouTube: YouTube uses Big Data to recommend videos based on previous viewing history,
which helps users discover new content.
For instance, if someone watches tutorials on Big Data, they may be shown ads or
recommendations for Big Data courses or related content.
Urban Planning and Transportation: Big Data is crucial for developing smart traffic
systems. Through data gathered from traffic cameras, GPS-enabled vehicles (e.g., Uber, Ola),
and road sensors, traffic management systems can:
4
MSN
Minimize fuel consumption by suggesting less congested roads.
Aviation: Aircraft sensors generate vast amounts of data about the plane's performance,
including speed, moisture, and environmental conditions. This data helps:
5. Auto-Driving Cars
Autonomous Vehicles: Self-driving cars rely on Big Data collected from multiple sensors,
cameras, and other devices. This data helps the car:
Personalized Assistance: Tools like Siri (Apple), Cortana (Microsoft), and Google Assistant
use Big Data to respond to user queries. These virtual assistants analyze data like:
Manufacturing and Healthcare: IoT devices generate massive amounts of data that can be
analyzed for operational efficiency and predictive maintenance.
5
MSN
8. Education Sector
Targeted Learning: Online educational platforms analyze user data, such as search queries
or videos watched on specific subjects, to target prospective students with advertisements or
course recommendations.
For instance, if someone watches tutorials on Python, they might see ads for coding boot
camps or online courses in Python.
9. Energy Sector
Smart Energy Management: Smart meters installed in households and manufacturing units
collect data on electricity usage in real-time. This data is then analyzed to:
Consumer Preferences: Streaming services like Netflix, Spotify, and Amazon Prime collect
data on user behavior, such as:
1. Better Decision-Making
6
MSN
3. Fraud Detection and Risk Management
5. Healthcare Advancements
Improves diagnosis, treatment, and patient care by analyzing large medical datasets.
Example: Big Data in genomics helps predict and prevent diseases.
Helps companies innovate faster by understanding market trends and customer needs.
Example: Automobiles with smart features driven by data insights (like Tesla).
7. Improved Security
There are number of tools used in BIGDATA. Most popular tools are: -
1. Apache Hadoop
An open-source platform that stores and distributes large data sets using computer
clusters.
It's one of the most powerful big data technologies, with the ability to grow from a
single server to thousands of computers.
It brings flexibility in Data Processing. It allows for faster data Processing.
2. Apache STORM
Storm is a free big data open source computation system.
It is one of the best big data tools which offers distributed real-time, fault-tolerant
processing system.
It has big data technologies and tools that uses parallel calculations that run across a
cluster of machines.
3. Qubole
Qubole Data is autonomous big data management platform.
It is a big data open-source tool which is self-managed, self-optimizing and allows the
data team to focus on business outcomes.
7
MSN
4. Apache Cassandra
The Apache Cassandra database is widely used today to provide an effective
management of large amounts of data.
It is one of the best big data tools which is most suitable for applications that can't
afford to
lose data, even when an entire data center is down
Cassandra offers support contracts and services are available from third parties
5. Pentaho
Pentaho provides big data tools to extract, prepare and blend data.
It offers visualizations and analytics that change the way to run any business.
This Big data tool allows turning big data into big insights.
6. Apache Flink
Apache Flink is one of the best open source data analytics tools for stream processing
big data.
It is distributed, high-performing, always-available, and accurate data streaming
applications.
It Provides results that are accurate, even for out-of-order or late-arriving data
7. Open Refine
OpenRefine is a powerful big data tool.
It is a big data analytics software that helps to work with messy data, cleaning it and
transforming it from one format into another.
It also allows extending it with web services and external data.
8. RapidMiner
RapidMiner is one of the best open-source data analytics tools.
It is used for data prep, machine learning, and model deployment.
It offers a suite of products to build new data mining processes and setup predictive
analysis.
9. Kaggle
Kaggle is the world's largest big data community.
It helps organizations and researchers to post their data & statistics. It is the best place
to analyze data seamlessly.
10. Apache Hive
Hive is an open-source big data software tool.
It allows programmers analyze large data sets on Hadoop.
It helps with querying and managing large datasets real fast.
8
MSN
5 Challenges in BIG DATA
Problem:
Many employees lack a clear understanding of data storage, processing, and its
importance.
Only data professionals are aware of the technical aspects, while others in the
organization may remain uninformed.
Example:
If employees don’t understand the importance of maintaining a backup, crucial customer data
might be lost in case of a server failure. This can disrupt business operations and harm
customer trust.
Solution:
Workshops and Seminars: Conduct Big Data training sessions for all employees
involved in data handling.
Basic Training Programs: Ensure every employee understands key data concepts
and storage protocols.
Company-wide Awareness: Foster a data-driven culture across all levels of the
organization.
Problem:
Example:
An e-commerce company stores customer data, purchase history, product images, and user
reviews. As its customer base grows, managing these datasets becomes increasingly
challenging, leading to slower processing times and higher costs.
Solution:
9
MSN
2. Deduplication: Remove duplicate data to avoid redundancy.
3. Data Tiering: Classify and store data based on its importance in appropriate storage
tiers (public cloud, private cloud, flash storage).
4. Big Data Tools: Use advanced tools like Hadoop, NoSQL databases, and Apache
Spark to manage and process large data sets efficiently.
Problem:
Companies struggle to choose the right tools for data storage and analysis due to the
wide variety of options available.
Common dilemmas include:
o Hadoop vs. Apache Spark for data processing
o HBase vs. Cassandra for data storage
Example:
A company may select Hadoop for real-time analytics, but it later realizes that Apache
Spark would have been a better choice for faster processing and machine learning tasks.
This mistake leads to delays and additional expenses.
Solution:
Problem:
There is a shortage of skilled professionals such as data scientists, data analysts, and
data engineers who can handle Big Data tools and extract meaningful insights.
The rapid evolution of data handling technologies has outpaced the development of
an adequately trained workforce.
10
MSN
Example:
A healthcare company implementing a Big Data project to analyze patient records may
struggle to find experienced professionals who can use tools like Hadoop, Spark, and Python
for predictive analytics. This results in project delays and incomplete analysis.
Solution:
5. Securing Data
Problem:
Example:
A retail company storing customer payment details in its data center experiences a data
breach. Sensitive information is stolen, leading to financial loss of over $3.7 million and
damage to the company’s reputation.
Solution:
11
MSN
File System Concept
A File System is a method and data structure that the operating system uses to manage files
on a storage device (like hard drives or SSDs).
It defines how data is stored, organized, and accessed, including how files and directories are
created, named, stored, and modified.
In the context of Big Data, file systems are crucial because they handle large volumes of data
that need to be stored efficiently and retrieved quickly for processing.
A Distributed File System (DFS) is a file system that allows files to be stored and accessed
across multiple machines or nodes in a network, rather than being confined to a single
machine. In the context of Big Data, DFS is crucial for efficiently managing and processing
vast amounts of data across a distributed environment.
DFS allows data to be stored in a distributed manner, spread across multiple devices.
This distributed architecture provides high scalability, fault tolerance, and efficient access to
large datasets, which are essential for Big Data applications.
A DFS typically integrates with frameworks like Hadoop, Spark, and NoSQL databases to
support large-scale data storage and analytics.
Components of DFS
1. Location Transparency:
2. Redundancy:
12
MSN
DFS ensures that copies of data are stored across different nodes to protect against
data loss.
If one node or server fails, the replicated copy on another node ensures that the
data is still available. This contributes to high availability and fault tolerance.
Features of DFS
DFS provides several features that make it efficient for handling large datasets:
1. Transparency:
DFS ensures that the underlying complexities are hidden from the user:
o Structure Transparency: Users do not need to know the number or location of file
servers. The file system automatically manages this.
o Access Transparency: Local and remote files are accessed in the same way. There
should be no distinction for the client when accessing files stored on local machines
versus files stored remotely across a network.
o Naming Transparency: The name given to a file should not reveal its location. Once
a file is named, its name should remain consistent even if the file is moved from one
node to another in the distributed system..
3. Performance:
DFS aims to deliver performance comparable to a centralized file system by efficiently
distributing the load across multiple nodes. This includes CPU time, storage access time, and
network access time. DFS strives to balance these elements to achieve fast access times.
5. High Availability:
DFS ensures data is always available, even in the case of node, link, or storage drive failures.
It uses data replication and fault-tolerant mechanisms to keep the system running smoothly.
13
MSN
Working of DFS
1. Distribution
Data Distribution: DFS splits data into smaller blocks or chunks, and these blocks are
distributed across multiple nodes or clusters. Each node has its own computing power,
allowing for parallel processing of data.
Parallel Processing: Since the data is distributed across nodes, each node can work on its
portion of data simultaneously, providing significant computational power and faster
processing times.
2. Replication
Data Replication: DFS replicates data blocks across different nodes or clusters. By copying
data onto multiple machines (often in separate racks or physical locations), the system
achieves fault tolerance. If one cluster or node fails, the data is still available from another
replica.
Fault Tolerance: In the event of a failure, the DFS system can retrieve the data from its
replicated copies, ensuring continuous access to the data. This feature makes the system
robust and reliable.
o Challenges with Data Replication: One challenge in DFS is maintaining
consistency between replicas. When a data block is modified on one node, the
changes must be reflected in all replicas across the system, which can be complex,
especially with frequent updates.
High Concurrency: The replication of data also enables high concurrency, meaning
multiple clients or nodes can access and process the same data simultaneously. Each node
uses its own computational resources to process a part of the data, improving overall system
performance.
Fault Tolerance: DFS is designed to handle failures gracefully. If a node or rack fails, data is still
accessible from other replicas. Data replication is a good way to achieve fault tolerance and high
concurrency; but it’s very hard to maintain frequent changes. Assume that someone changed a data
block on one cluster; these changes need to be updated on all data replica of this block.
14
MSN
High Concurrency: DFS allows for multiple clients to simultaneously access and process the same
data. Each client may access different parts of the data or the same data in parallel, enhancing the
system's performance by utilizing the computational power of each node.
• Scalability:
DFS allows you to scale up your infrastructure by adding more racks or clusters as your data
storage and computational needs grow. As more nodes are added, the system can handle
larger amounts of data and more client requests, which is essential in Big Data environments.
• Fault Tolerance:
Data replication will help to achieve fault tolerance in the following cases:
• Cluster is down
• Rack is down
• Rack is disconnected from the network.
• Job failure or restart.
• High Concurrency:
DFS enables high concurrency by allowing multiple clients to access and process data
simultaneously across various nodes. By replicating data across multiple clusters, the system
can serve many requests at the same time, maximizing the computing power of each node.
• In Distributed File System nodes and connections needs to be secured therefore we can say that
security is at stake.
• There is a possibility of loss of messages and data in the network while movement from one node
to another.
• Database connection in case of Distributed File System is complicated.
• Also handling of the database is not easy in Distributed File System as compared to a single user
system.
• There are chances that overloading will take place if all nodes try to send data at once.
15
MSN
Scalable Computing Over the Internet
Scalable Computing in Big Data refers to the ability of a system to handle the growing
volume of data and increased computational demands by adding resources such as computing
power, storage, or network bandwidth.
Scalable computing is crucial in Big Data environments, where data grows exponentially,
and real-time processing is often required.
Real-World Examples:
In the past, computers worked on their own or in small networks. With the internet’s rise, computing
became global. Now, computers can share resources and power through the internet.
16
MSN
2. High-Performance Computing (HPC)
Used for: Analyzing big data, processing bank transactions, bioinformatics (DNA
sequencing).
How it works: Queues up tasks and processes them one by one across multiple computers.
Example: A bank processing millions of transactions daily.
1. Cloud Computing
o On-demand resources like storage, servers, and applications over the internet.
o Example: Google Drive, AWS (Amazon Web Services).
2. Grid Computing
o Uses computers from different locations to work on one big task.
3. Edge Computing
o Processes data closer to where it’s created (like sensors or IoT devices) to reduce
delay.
o Example: Smart traffic lights in cities or autonomous vehicles.
17
MSN
6. Distributed System Families
Distributed systems are groups of computers working together to solve tasks.
Here are the main types:
1. Cluster Computing
o Many computers act like a single machine.
o Example: Google’s data centers for search engine results.
2. Grid Computing
o Computers from different locations connect to work on big tasks.
3. Peer-to-Peer (P2P) Systems
o No central server; computers share resources equally.
o Example: Bit Torrent for file sharing.
4. Cloud Systems
o Provides scalable services like storage and virtual machines.
o Example: Microsoft Azure, AWS.
7. Degrees of Parallelism
Parallelism means doing many tasks at the same time to speed things up. There are different levels
of parallelism:
1. Bit-Level Parallelism
2. Instruction-Level Parallelism
4. Data Parallelism
o Splits a large dataset into smaller parts and processes them at once.
o Example: Analyzing large customer databases in parallel to find patterns.
18
MSN
Popular models for big data
1. MapReduce
MapReduce is a programming model used for processing large amounts of data across many
computers. It splits data processing into two main steps:
Working of MapReduce:
1. Map Stage:
o The map task takes input data, typically stored in a distributed file system like
Hadoop Distributed File System (HDFS).
o It processes the input line-by-line and converts each line into a set of key-value pairs
(tuples).
o These key-value pairs are the output of the map task, and they represent the
intermediate results.
2. Shuffle Stage:
o This stage is responsible for sorting and transferring the data produced by the map
function to the appropriate reducers.
o The shuffle phase groups the data by key, ensuring that all values for the same key
are brought together. This is where the data is sorted and distributed.
3. Reduce Stage:
o The reduce task takes the shuffled and grouped key-value pairs as input.
o It then processes the data, combining or aggregating the values based on the keys.
o The output of the reduce stage is a smaller set of key-value pairs, which represents the
final result of the computation.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
19
MSN
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network
traffic. After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server
Advantages of MapReduce:
Scalability: It is easy to scale data processing over multiple machines in a cluster. As the
amount of data grows, the model allows applications to be scaled by simply increasing the
number of nodes.
Fault Tolerance: MapReduce provides built-in fault tolerance by reassigning tasks if a
failure occurs.
Data Locality: The data is processed where it resides (on local disks), reducing network
traffic and improving performance.
Parallelism: The map and reduce functions run in parallel, which significantly speeds up the
processing of large datasets.
A Directed Acyclic Graph (DAG) is a directed graph that has no directed cycles. It consists of
vertices (or nodes) connected by edges, where each edge has a direction pointing from one vertex to
another.
1. Directed: Each edge has a direction, meaning it points from one vertex to another.
20
MSN
2. Acyclic: There are no cycles in the graph, meaning it’s impossible to start from a vertex and
return to it by following a series of directed edges.
3. Topologically Ordered: A DAG can always be arranged in a way such that for every
directed edge u → v, vertex u comes before vertex v in the ordering.
Example of a DAG
Topological sorting is a linear ordering of vertices such that for every directed edge (u → v), vertex
u comes before v.
1. 5 → 4 → 2 → 3 → 1 → 0
2. 4 → 5 → 2 → 3 → 1 → 0
Applications of DAG
21
MSN
3. Job Scheduling in Operating Systems
o Used to determine the execution order of jobs in operating systems and database
query optimization.
4. Artificial Intelligence and Machine Learning
o Neural networks and probabilistic graphical models (e.g., Bayesian Networks) use
DAG structures for inference and learning.
5. Block chain and Crypto currencies
o Some crypto currencies (e.g., IOTA and Nano) use DAG instead of block chain to
improve scalability and reduce transaction costs.
3. Message Passing
Process communication is the mechanism provided by the operating system that allows
processes to communicate with each other.
One of the models of process communication is the message passing model.
Message passing model allows multiple processes to read and write data to the message
queue without being connected to each other.
Example:
Advantages
Disadvantages
22
MSN
Step-by-Step Approach to Workflow Orchestration for Big Data
Workflow orchestration in Big Data involves managing and automating complex data
pipelines to ensure efficient execution and coordination of tasks.
It involves managing dependencies between tasks, handling failures, optimizing
performance, and ensuring smooth data flow across different systems and services.
23
MSN
Step 6: Deploy the Workflow
****************************************
24
MSN