BDA notes part 1
BDA notes part 1
Big Data is a collection of data that is huge in volume, growing exponentially with time. Data
in Peta bytes (1015 bytes) is called Big Data. It is stated that almost 90% of today’s data has
been generated in the past 3 years.
Big Data is bringing about changes in our lives because it allows diverse and heterogeneous
data to be fully integrated and analysed to help us make decisions.
Big Data is the term for collection of data sets so large and complex that it becomes difficult
to process using on-hand database system tools or traditional data processing applications.
Examples:
90% of the world’s data has been created in last two years.
Walmart handles more than 1 million customer transactions every hour.
Facebook stores, accesses and analyses 30+ Peta bytes of user generated data.
230+ millions of tweets are created every day.
Big data is very important for organizations or companies varying from medium-size to large-
size because it enables them to gather, store, manage and manipulate extremely large
amounts of data, extremely high velocity of data and extremely wide variety of data.
Rather than anonymously making decisions, companies are considering big data analytics
before concluding to any decision. Big Data Analytics is that it has boosted the decision-
making process to a great extent.
Big Data Analytics is used by various firms to create new products and services for their
customers. Companies through big data, analyse different customer’s opinions about their
products and how their product is perceived.
3. Big Data in Educational Sector:
Big Data benefits educational sector in managing the data related to students. Analysis of
the capabilities of students based on the data can help teachers in nurturing their future in a
better way.
Through big data, companies analyse the prices that have yielded the maximum profits to
them under various historic market conditions. Through big data solutions, they set their
product’s price according to the customer’s willingness to pay under different circumstances.
Online searching has been made easy with the help of recommendation engines by using Big
Data Analytics. Companies analyse every customer’s data and then recommend them
accordingly. These recommendations are majorly based on the activities the customer did
when he last visited the platform and his real-time activities.
Big data enhances overall operational efficiency of healthcare companies. Big Data Analytics
would allow them to find a better cure for a disease by recognizing unknown connections
and hidden patterns.
7. Fraud Detection:
Customer information can be analysed to predict general trends and spot fraudulent
behaviour.
8. Agriculture:
Big Data provides granular data on rainfall patterns, water cycles and enables farmers to
make smart decisions such as what crops to plant for better profitability and when to
harvest.
Big Data Analytics is the use of advanced analytical techniques against very large, diverse
datasets that include structured, semi-structured and unstructured data from diff. sources
and diff. size.
1. Descriptive Analysis:
As the name suggests, description is there. Explains what is happening based on incoming
data.
2. Predictive Analysis:
As the name suggests, prediction is there. Forecasts what might happen in the future based
on data trends and patterns.
3. Prescriptive Analysis:
Determines the best course of action based on data insights. It goes beyond prediction by
recommending actions to achieve desired outcomes.
e.g. Google’s self-driving cars (analyses sensor data, traffic patterns and road conditions to
make real-time driving decisions. If an obstacle is detected, the system prescribes actions
like slowing down, changing lanes, or stopping to ensure safety).
4. Diagnostic Analysis:
It is designed to handle the ingestion processing and analysis of data that is too large or
complex for traditional database systems.
Ingestion: Used in data capture. Collects diff. types of data from diff. sources or platforms.
Analyse data whether that is structured or unstructured and where data comes from.
Data Storage: It is used to store data whereas real-time message ingestion is used to store
real data.
Batch Processing: Data stored is shared with batch processing and divided into different
batches. It passes the data to the analytical data store for analysis before forwarding it for
further processing or insights.
e.g. When a 50 MB video recording from a camera is uploaded as a WhatsApp status, it is
automatically compressed to 5-6 MB due to processing in the analytical data store. This
happens because an algorithm or compression technique is applied which reduces the file
size while maintaining acceptable quality.
Machine Learning: It processes both batch and streaming data. It analyses data in batches at
scheduled intervals and also processes streaming data for instant insights.
During streaming, if internet speed drops or data runs out, the system automatically lowers
video quality to ensure smooth playback. Afterward, during photo upload, if a photo fails but
shows as "processing," data analytics and reporting tools help track details like device,
location and upload time.
Orchestration: Automates workflows (eliminates the need for manual intervention), ensures
that the tasks run in the correct sequence, coordination and management.
1. Data Capture: It refers to the process of collecting data from a variety of sources. This
includes everything from social media to sensor reading.
2. Data Storage: It is a process of storing the data in a way that makes it accessible for the
future analysis.
3. Data Processing: This is where algorithms are used to analyse the data and extract
insights.
4. Data Visualization: It is a process of representing the data in a way that is easy for
humans to understand.
e.g. Flow chart problem, use-case diagram, graph or chart is there in data visualization.
1. Quick Data Growth: The amount of data being stored in data centers and databases of
companies is increasing rapidly. As these datasets grow exponentially with time, it gets
extremely difficult to handle.
2. Storage: Such large amount of data is difficult to store and manage by organizations
without any appropriate tool and technology.
3. Syncing across data sources: When organization imports data from different data
sources, data from one source might not be upto date as compared to data from another
source.
4. Security: Securing these huge datasets is one of the daunting challenges of Big Data.
Some big data stores can be attractive targets for hackers or advanced persistent threats.
5. Unreliable Data: Big data cannot be completely accurate and may contain some
redundant or incomplete data.
6. Miscellaneous Challenges: More challenges exist such as generating insights in timely
manner or recruiting and retaining big data professionals.
Data Stream Management System (DSMS):
It is a specialized system designed to process and manage continuous data streams in real-
time. Unlike traditional database management systems (DBMS) that store and process static
data, a DSMS continuously ingests, analyses, and queries dynamic data streams.
Key features:
Components:
1. Data Stream: A continuous flow of data coming from sources like sensors, social media,
or transactions. It never stops and keeps updating in real-time.
2. Stream processor: The brain of the DSMS. It processes incoming data, applies filters,
aggregates information, and runs computations in real-time.
3. Standing queries: Queries that run continuously on streaming data, updating results as
new data arrives. Example: A query that always shows the average temperature from
sensors.
4. Adhoc queries: One-time queries that analyse the current data stream. Example: A user
asks, "What was the peak website traffic in the last hour?"
5. Archival storage: A place where old data is stored permanently for historical analysis and
backup. Example: A database keeping records of all financial transactions.
6. Limited working storage: A small temporary memory space used to process real-time
data, as storing everything is impossible. Example: Only keeping the last 10 minutes of
sensor readings to detect trends.
Drivers of Big Data:
Big data is driven by several key factors that make it grow and become more important.
More Data Sources: Every day, people and machines create huge amounts of data
through social media, online shopping, smart devices, and sensors. The more sources we
have, the bigger the data gets.
Faster Internet & Technology: With better internet speeds and advanced technologies
like cloud computing, data can be collected, stored, and processed quickly.
Cheaper Storage: Storing large amounts of data used to be expensive, but now it's much
cheaper, allowing companies to keep and analyse more information.
Artificial Intelligence (AI) & Machine Learning: AI systems learn from big data,
improving their accuracy and making predictions, which in turn drives the need for even
more data.
The Internet of Things (IoT): Smart devices like fitness trackers, home assistants, and
self-driving cars are constantly generating data, adding to the big data explosion.
A data stream model is a way to handle and process continuous, fast-flowing data in real
time. Unlike traditional databases, where data is stored and then analysed, data stream
models analyse data as it arrives.
1. Time-Based Model:
2. Count-Based Model:
Divides data into fixed chunks and processes each batch separately.
5. Sketch-Based Model:
Uses approximations to handle large data streams efficiently.
Streaming Methods:
Streaming methods are techniques used to process and analyse continuous data streams in
real-time. Instead of storing data first and then analysing it, these methods handle data as it
arrives.
1. Batch Processing:
3. Micro-Batch Processing:
o A mix of batch and real-time processing where small chunks of data are processed
frequently.
4. Window-Based Processing:
Data Synopsis:
Data synopsis is a technique used to create a small, summarized version of large data sets. It
helps in quickly analysing and processing data without storing or handling the full dataset.
This is especially useful in real-time data streams, where data is too large to store entirely.
1. Sampling:
2. Sketching:
o Example: Estimating the number of unique visitors on a website without storing all
IP addresses.
3. Histogram:
o Divides data into ranges and counts how many values fall into each range.
4. Wavelet Transform:
5. Sliding Windows:
Summarization Techniques:
When dealing with large amounts of data, it’s not always possible to store or analyse
everything. These techniques help by reducing data while keeping the most important
information.
1. Sampling means selecting a small part of the data that represents the whole dataset.
Instead of analysing every piece of data, we work with a smaller, manageable sample.
🔹 Example:
Imagine a company receives 1 million customer reviews. Instead of analysing all, they
randomly pick 10,000 reviews to understand customer sentiment.
2. Filtering removes irrelevant or unnecessary data, keeping only what is important.
🔹 Example:
A weather monitoring system collects temperature, humidity, and wind speed data. If a
researcher is only interested in temperature, they filter out the other data.