IFETCE R – 2023
ACADEMIC YEAR 2025-2026
IFET COLLEGE OF ENGINEERING
DEPARTMENT OF AI&DS
23AD55401 – Big Data Engineering
MODULES
UNIT I INTRODUCTION
Introduction to Big Data- Characteristics of Big Data & Its Applications- Types of Analytics
(Descriptive, Diagnostic, Predictive, Prescriptive)- Analytics flow for Big Data -Big Data
Stack – Data Acquisition & Ingestion - Web Data & Its Applications - Business Intelligence -
Big Data Analytics vs. Web Data Analytics.
1.1 Introduction to Big Data:
Big data refers to extremely large and complex datasets that traditional data
processing systems struggle to handle efficiently. It's characterized by high volume, velocity,
and variety, making it challenging to store, manage, and analyze using conventional
methods. Essentially, it's the vast amount of data generated from various sources that requires
specialized tools and techniques for meaningful insights.
1.1.1 Importance of big data:
Improved decision-making:
Big data analytics can reveal hidden patterns and trends, enabling organizations to make more
informed and data-driven decisions.
Enhanced customer understanding:
By analyzing customer data, businesses can gain a deeper understanding of their needs,
preferences, and behaviors, leading to better products and services.
Increased efficiency and productivity:
Big data can identify bottlenecks and inefficiencies in processes, allowing for optimization
and improved resource allocation.
Innovation and new product development:
Big data insights can spark innovation and the development of new products and services
tailored to market demands.
1.1.2 Examples of Big Data in action:
Social media analysis:
Companies analyze social media posts to understand public sentiment, identify trends, and
tailor marketing campaigns.
Healthcare:
Hospitals use big data to analyze patient data for disease prediction, personalized treatment
plans, and improved patient care.
E-commerce:
Retailers analyze purchase history and browsing behavior to personalize recommendations
and offer targeted promotions.
Finance:
Banks and financial institutions use big data to detect fraud, manage risk, and improve
financial modeling.
Transportation:
Cities use big data to optimize traffic flow, manage public transportation, and improve
infrastructure planning.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
1.2 Characteristics of Big Data & Its Applications:
1.2.1 Characteristics of Big Data:
Volume:
Refers to the massive scale of data being generated, often measured in terabytes, petabytes, or
even zettabytes.
Velocity:
Highlights the speed at which data is generated and processed, with real-time or near real-
time analysis becoming crucial.
Variety:
Emphasizes the diverse range of data formats, including structured (databases), unstructured
(social media posts, videos), and semi-structured (log files).
Veracity:
Focuses on the trustworthiness and quality of the data, ensuring accuracy and reliability.
Value:
Underlines the potential insights and business value that can be derived from analyzing big
data.
Figure 1.1 5V’s of Big Data
1.2.2 Applications of Big Data
Big data is valuable and a powerful duel for all sectors today. It is used by almost all
organizations and has many use cases. Let us look at some applications of big data.
Travel and tourism: Big data helps in predicting requirements such as those for
travel facilities. Through this, the businesses have noticed significant improvement.
Finance and Banking: This sector extensively uses big data to understand customer
behaviour through patterns and other trends.
Healthcare: There has been a revolution in the healthcare sector, thanks to big data.
Through predictive analytics, healthcare personnel are able to provide personalized
services to patients, thereby improving outcomes.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Telecommunication and multimedia:Given how much data is generated in this
sector daily, big data technologies are required to handle such huge data.
1.3 Types of Analytics
Analytics is used in almost every industry. The technological changes you see every
day is all because of analytics.
we will see the main types of analytics
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
Figure 1.2 Analytics types
1. Descriptive Analytics:
Purpose: Answers the question "What happened?" by summarizing historical
data.
Focus: Reveals patterns, trends, and key metrics from past events.
Techniques: Data aggregation, visualization (charts, graphs, dashboards),
summary statistics (mean, median, mode, etc.), segmentation, and filtering.
Examples: Generating reports on monthly sales figures, analyzing website
traffic trends over time, and tracking key performance indicators (KPIs) like
revenue or conversion rates.
2. Diagnostic Analytics:
Purpose: Answers the question "Why did it happen?" by delving into data to
understand the root causes of events or outcomes.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Focus: Investigating anomalies, patterns, and causal relationships within data.
Techniques: Data discovery, drill-down/drill-through reports, correlation
analysis, root cause analysis (e.g., "5 Whys"), anomaly detection, and
hypothesis testing.
Examples: Understanding why sales dropped in a specific region, identifying
factors leading to increased customer churn, and analyzing why a particular
marketing campaign failed to meet targets.
3. Predictive Analytics:
Purpose: Answers the question "What will happen?" by forecasting future
outcomes based on historical data and patterns.
Focus: Utilizing statistical techniques, machine learning, and data mining to
predict future events or behaviors.
Techniques: Regression analysis (linear, logistic), time-series forecasting,
classification models (decision trees, random forests), and neural networks.
Examples: Predicting future sales trends, forecasting demand for products or
services, assessing credit risk, and anticipating customer churn.
4. Prescriptive Analytics:
Purpose: Answers the question "What should we do?" by recommending
specific actions to optimize outcomes based on predictions and business
objectives.
Focus: Generating actionable insights and guiding optimal decision-making by
considering multiple scenarios, constraints, and business rules.
Techniques: Optimization algorithms, simulation modeling (e.g., Monte
Carlo), decision analysis (decision trees, utility theory), and prescriptive
machine learning (reinforcement learning).
Examples: Optimizing supply chain routes to minimize costs and maximize
efficiency, recommending personalized marketing offers to specific customer
segments, and determining optimal pricing strategies.
1.3.1 The relationship between the different types of analytics
Each type of analytics builds upon the previous, with descriptive analytics forming
the foundation.
Diagnostic analytics provides deeper insights into the causes of patterns observed
through descriptive analytics.
Predictive analytics utilizes insights from descriptive and diagnostic analytics to
forecast future outcomes.
Prescriptive analytics leverages all previous insights to recommend optimal actions
and guide decision-making.
1.4 Analytics Flow for Big Data
The analytics flow for big data refers to the process of collecting, storing, processing,
and analyzing large and complex data sets to gain insights and make better decisions. It
typically includes the following steps:
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
1. Data collection: Data is collected from various sources such as social media, IoT
devices, and sensors. The data can be structured, semi-structured, or unstructured and
may need to be cleaned and transformed before it can be analyzed.
2. Data storage: The data is stored in a centralized repository such as a data lake,
Hadoop Distributed File System (HDFS), or NoSQL database.
3. Data processing: The data is processed using technologies such as Hadoop
MapReduce, stream processing, and machine learning to extract insights and prepare
it for analysis.
4. Data analysis: The data is analyzed using tools such as SQL, data visualization, and
machine learning algorithms to gain insights and make better decisions.
5. Data governance: Data governance policies and procedures are put in place to ensure
data is accurate, complete, consistent and compliant with regulations.
6. Data security: Security measures such as data encryption, access controls, and
incident response are implemented to protect sensitive information and prevent
unauthorized access.
7. Data visualization: The data is transformed into interactive and easy-to-understand
visualizations using tools such as Tableau, QlikView and Power BI.
8. Decision-making: Insights from the data are used to make better decisions and take
action.
1.5 Big Data Stack
The big data stack refers to the combination of technologies and tools that are
used to collect, store, process, and analyze large and complex data sets.
It typically includes the following layers:
1. Data Ingestion: This layer is responsible for collecting data from various sources
such as social media, IoT devices, and sensors. Technologies such as Apache Kafka,
Apache NiFi, and Apache Storm are commonly used for data ingestion.
2. Data Storage: This layer is responsible for storing the data in a centralized repository
such as a data lake, Hadoop Distributed File System (HDFS), or NoSQL database.
Technologies such as Apache Hadoop. Apache Cassandra, and MongoDB are
commonly used for data storage.
3. Data Processing: This layer is responsible for processing the data using technologies
such as Hadoop MapReduce, Apache Spark, and Apache Storm. This layer also
includes machine learning libraries like Apache Mahout and MLlib.
4. Data Analysis: This layer is responsible for analyzing the data using tools such as
SQL, data visualization, and machine learning algorithms. Technologies such as
Apache Hive, Apache Pig, and Apache Impala are commonly used for SQL-based
analysis, while tools such as Tableau, QlikView, and Power BI are commonly used for
data visualization.
5. Data Governance: This layer is responsible for data governance policies and
procedures to ensure data is accurate, complete, consistent and compliant with
regulations. Technologies like Apache Atlas, Apache Ranger, and Apache Sentry are
commonly used for data governance.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
6. Data Visualization: This layer is responsible for transforming the data into
interactive and easy-to-understand visualizations using tools such as Tableau,
QlikView, and Power BI.
7. Data Security: This layer is responsible for implementing security measures such as
data encryption, access controls, and incident response to protect sensitive
information and prevent unauthorized access Technologies such as Apache Ranger,
Apache Knox, and Apache Sentry are commonly used for data security.
8. Operationalization: This layer is responsible for deploying and managing the big
data stack in production environments. Technologies such as Apache Ambari, Apache
ZooKeeper, and Apache Mesos are commonly used for operationalization.
1.6 Data Acquisition &Ingestion
1.6.1 Data Acquisition:
Data acquisition is the process of collecting raw data from its source. This
could involve anything from sensors and databases to social media feeds and APIs.
Examples:
Collecting website clickstream data.
Reading data from IoT sensors.
Extracting data from social media platforms.
Key aspects:
Identifying relevant data sources.
Choosing appropriate methods for capturing data (e.g., APIs, web scraping,
direct sensor readings).
Understanding the format and structure of the incoming data.
Implementation of data acquisition:
BDE terminals: These are specialized devices installed at workstations or directly at
production machines, allowing employees to input operational data and access
relevant information in real-time.
Barcodes: Used to identify and track production orders, materials, or products swiftly
and accurately, minimizing data entry errors.
Integration with other systems: Modern BDE systems are often integrated
with MES (Manufacturing Execution Systems), ERP (Enterprise Resource
Planning) systems, and control station software to ensure seamless data exchange and
comprehensive monitoring of production processes.
Benefits of data acquisition:
Increased transparency: Providing a clear overview of production processes and
real-time reporting of malfunctions.
Resource optimization: Enhancing the utilization of machines and personnel.
Improved cost accounting: Providing accurate data for better cost calculation and
control.
Enhanced production planning and control: Offering valuable feedback and
enabling data-driven optimization of processes.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Better customer satisfaction: Through more accurate estimations of delivery times
and improved product quality
Challenges and limitations
Initial setup costs: Implementing BDE requires investments in infrastructure,
software, and training.
Employee training: Ensuring effective use of the system requires proper training
programs and materials.
Data quality issues: Duplicate records, inaccurate details, and formatting errors can
affect the validity of insights.
Integration with existing systems: Adapting existing processes and ensuring smooth
integration with legacy systems can be complex
1.6.2 Data Ingestion:
Data ingestion is the process of moving and preparing acquired data for storage
and analysis within a system.
Key aspects:
Data Transformation: Converting data into a usable format (e.g., cleaning,
standardizing, enriching).
Data Validation: Ensuring data quality and consistency.
Data Loading: Moving the transformed and validated data into a storage
system (e.g., data lake, data warehouse).
Different approaches: Data ingestion can be done in batches or in real-time,
depending on the application.
Examples:
Loading transaction data into a data warehouse.
Processing sensor data in real-time for anomaly detection.
Moving data from a legacy system to a cloud-based data warehouse.
Importance of data Ingestion:
Better data quality: Organizations may detect and fix mistakes in their data by
merging information from several sources.
Improved ability to make decisions Businesses may make more informed decisions
by seeing patterns and trends in their data that would go unnoticed if they didn't have
access to a single, cohesive picture of it.
Automated business processes: Organizations can save time and money by
automating business processes through the integration of data from many sources.
1.6.3 Type of Data Ingestion
Different Data Ingestion Types, including real-time, batch, and
combination, were designed based on the IT infrastructure and business needs.
Among the techniques for data intake are:
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
1. Real-Time Data Ingestion
Figure 1.3 Real-Time Data Ingestion
Real-Time Data Ingestion is the process of collecting and sending data from source
systems in real-time solutions like Change Data Capture (CDC). One of the most popular
types of data intake, particularly for streaming services, is this one. CDC transports updated
data and redoes logs while continually keeping an eye on transactions, all without attempting
to impede database activity. For time-sensitive use cases where organizations must respond
fast to fresh data, like stock market trading or power grid tracking, real-time ingestion is
essential. Additionally, in order to define and act upon new insights and make operational
decisions fast, real-time data pipelines are required. Real-time data intake involves the
extraction, processing, and archiving of data as soon as it is created to facilitate prompt
decision-making.
2. Batch-Based data ingestion
Batch-Based data ingestion is the practice of gathering and sending data in batches at
regular intervals. For repeated procedures, data ingested in batches has the advantage of
being transported at regularly scheduled periods. The ingestion layer can gather data using
batch-based data intake types according to trigger events, basic schedules, or any other
logical ordering. Batch-based ingestion becomes advantageous when an organization needs to
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
gather particular data points on a daily basis or just does not need data for making decisions
in real time.
Figure 1.4 Batch-Based data ingestion
3. Micro batching
Micro-batching is a data ingestion technique that falls between real-time and batch-
based approaches. It involves collecting and processing data in small, predefined batches at
regular intervals, typically ranging from milliseconds to seconds. This approach combines the
advantages of both real-time and batch processing while addressing some of their
limitations.
In micro-batching, data is collected continuously, but instead of processing individual events
instantaneously, they are grouped into small batches before processing. This allows for more
efficient resource utilization compared to processing each event in real-time. At the same
time, it offers lower latency compared to traditional batch processing, as the processing
intervals are much shorter.
1.7 Web Data & Its Applications
1.7.1 Web data:
Web data is a valuable resource for big data engineering due to its volume, variety,
and velocity. It encompasses various sources like social media, news websites, and online
forums, providing insights into market trends, consumer behavior, and emerging
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
technologies. Big data engineering techniques are then used to collect, store, process, and
analyze this web data to extract meaningful information for various applications.
Data Collection:Web scraping, APIs, and web crawlers are used to gather data from
diverse online sources.
Data Storage:NoSQL databases (like MongoDB and Cassandra) and cloud storage
solutions (like AWS, Azure, and Google Cloud) are used to handle the massive
volume and variety of web data.
Data Processing:Frameworks like Hadoop and Spark are employed for distributed
processing of web data, enabling efficient analysis of large datasets.
Data Analysis:Techniques like data mining, machine learning, and statistical analysis
are applied to web data to identify patterns, trends, and insights.
1.7.2 Applications of Web data
Personalized Recommendations: E-commerce sites use browsing history and
purchase data to suggest products.
Sentiment Analysis: Social media data is analyzed to gauge public opinion about
products, brands, or events.
Market Research: Web data provides insights into market trends, consumer
preferences, and competitor analysis.
Fraud Detection: Financial institutions analyze transaction data and user behavior to
identify fraudulent activities.
Targeted Advertising: Analyzing web data allows for more effective and targeted
advertising campaigns.
Demand Forecasting: Web data can help predict future demand for products and
services.
Content Recommendation: Platforms like Netflix use user data to recommend
content.
Network Optimization: Telecom companies analyze network traffic data to improve
network performance and customer experience.
Risk Management: Insurance companies use web data to assess risk and personalize
insurance products, according to Oracle.
1.7.3 Big Data Technologies:
Hadoop: A framework for distributed storage and processing of large datasets,
according to Simplilearn.
Spark: An open-source cluster-computing framework known for its speed and
efficiency in processing large datasets.
Kafka: A distributed streaming platform for building real-time data pipelines.
NoSQL databases: Databases like MongoDB and Cassandra are used for storing
unstructured and semi-structured web data.
Cloud platforms (AWS, Azure, Google Cloud): Provide scalable infrastructure for
big data storage and processing.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Python, Java, Scala: Programming languages commonly used in big data
engineering.
1.8 Business Intelligence in big data engineering
Figure 1.5 Business Intelligence in big data
Business intelligence (BI) in the context of big data engineering
involves leveraging big data technologies to enhance and extend traditional BI capabilities. It
bridges the gap between raw, massive datasets and actionable insights for business decision-
making.
Here's a breakdown:
1.8.1 Traditional Business Intelligence:
Focuses on structured data and established methods like OLAP and data mining.
Uses tools for reporting, querying, and data visualization to provide insights.
Often relies on data warehouses and data marts.
Primarily deals with understanding past and present performance.
1.8.2 Big Data's Role:
Big data analytics extends BI by incorporating advanced techniques like machine
learning and predictive analytics.
It handles large volumes of structured, semi-structured, and unstructured data.
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Big data technologies (Hadoop, Spark, NoSQL) enable processing and analysis of this
diverse data.
Aims to discover hidden patterns, trends, and correlations, often leading to predictive
insights.
1.8.3 Integration of BI and Big Data:
Enhanced Insights:
Big data analytics provides deeper insights by uncovering patterns and trends not readily
apparent in traditional BI.
Predictive Power:
Big data enables predictive analytics, allowing businesses to forecast future trends and make
proactive decisions.
Improved Decision Making:
By combining traditional BI with big data insights, businesses can make more informed and
data-driven decisions.
Data Engineering Foundation:
Data engineering plays a crucial role in building the infrastructure (data pipelines, data lakes,
data warehouses) necessary to support both BI and big data analytics.
Centralized Data:
Data engineering helps create centralized data warehouses for analytics, consolidating data
from various sources.
1.8.4 Examples of Applications:
Retail: Analyzing sales data, customer behavior, and inventory to optimize pricing
and promotions.
Finance: Detecting fraud, assessing risk, and improving investment strategies.
Healthcare: Identifying disease patterns, personalizing treatment, and improving
patient outcomes.
Marketing: Targeted advertising, customer segmentation, and campaign
optimization.
1.8.5 Key Components:
Data Warehousing: A centralized repository for storing and managing large amounts
of data.
Data Pipelines: Automated systems for moving and transforming data.
Data lakes: Repositories for storing raw, unstructured data.
BI tools: Software for reporting, querying, and visualizing data (e.g., Tableau, Power
BI).
Data Mining: Algorithms for extracting patterns and knowledge from data.
Machine Learning: Algorithms for predictive modeling and pattern recognition.
1.9 Big Data Analytics vs Web Data Analytics
Big Data Analytics and Web Data Analytics, while related, represent distinct areas
within the field of data analysis, particularly when considering their application in big data
engineering:
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
1.9.1 Big Data Analytics:
Scope:
Deals with massive, diverse datasets (structured, semi-structured, unstructured) that exceed
the capabilities of traditional data processing tools. This includes data from various sources
like IoT devices, enterprise systems, social media, and more.
Focus:
Extracting valuable insights, patterns, and trends from these large and complex datasets to
support strategic decision-making across various industries and business functions.
Techniques:
Employs advanced analytical techniques such as machine learning, data mining, statistical
modeling, and predictive analytics.
Tools:
Utilizes distributed computing frameworks like Hadoop and Spark, NoSQL databases, and
specialized big data analytics platforms.
Goal:
To uncover hidden opportunities, optimize operations, improve efficiency, and drive
innovation at an enterprise-wide level.
Big Data Analytics Life cycle:
The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
9. Utilizationof analysis results.
Figure 1.6 Life cycle of Big Data Analytics
1.9.2 Web Data Analytics:
Scope:
Specifically focuses on data generated from websites and web applications. This includes
user behavior data, traffic patterns, conversion rates, clickstream data, and more.
Focus:
Understanding user interactions, optimizing website performance, improving user experience,
and enhancing online marketing and sales strategies.
Techniques:
Primarily uses methods like web traffic analysis, A/B testing, funnel analysis, and user
segmentation.
Tools:
Relies on web analytics platforms (e.g., Google Analytics), A/B testing tools, and content
management system (CMS) insights.
Goal:
To improve website usability, increase engagement, drive conversions, and achieve specific
online business objectives.
1.9.3 Key Aspects of Web Data Analytics in Big Data Engineering:
Data Collection:
This involves gathering data from various web sources, including website traffic, user
interactions, and other online activities.
Data Storage and Processing:
IFETCE R – 2023
ACADEMIC YEAR 2025-2026
Big data technologies like Hadoop and Spark are used to store and process massive amounts
of web data efficiently.
Data Analysis:
Methods like descriptive, diagnostic, predictive, and prescriptive analytics are applied to
uncover patterns, trends, and correlations in the data.
Actionable Insights:
The goal is to translate the analyzed data into actionable insights that can be used to improve
website design, content, marketing campaigns, and overall user experience.
Tools and Technologies:
Various tools like Google Analytics, Adobe Analytics, and Amplitude are used for web
analytics, often integrated with big data platforms.
1.9.4 Benefits of Web Data Analytics in Big Data Engineering:
Improved Website Performance:
By analyzing user behavior, websites can be optimized for better navigation, faster loading
times, and improved user experience.
Enhanced Marketing Strategies:
Web data analytics helps in understanding customer preferences and targeting campaigns
more effectively.
Increased Conversions and Sales:
By identifying areas where users drop off or abandon tasks, websites can be optimized to
improve conversion rates and drive more sales.
Better Decision Making:
Data-driven insights enable businesses to make informed decisions about website design,
content, and marketing strategies.
Examples:
BT group:
Uses Amazon Managed Service for Apache Flink to gain real-time insights into call patterns
on their network, enabling faster issue resolution.
Flutter Entertainment:
Leverages Amazon Redshift to scale their data infrastructure and maintain a consistent user
experience as their data volume grows.