Data Analytics
Unit - II | Data Analysis
By - Er. Monu Kumar
B.Tech(CSE), M.Tech(CSE), NET JRF, Ph.D(CSE)*
Introduction to Data Analysis
Introduction to Data Analysis
● Definition: Data analysis involves cleaning, transforming, modeling,
and visualizing data to uncover valuable insights and support
decision-making in engineering applications.
● Importance:
○ Optimizing processes and designs.
○ Predicting system behavior.
○ Identifying potential failures and anomalies.
○ Making data-driven decisions
Introduction to Data Analysis
● Key Areas:
○ Regression Modeling
○ Multivariate Analysis
○ Bayesian Modeling & Networks
○ Support Vector & Kernel Methods
○ Time Series Analysis
○ Rule Induction
○ Neural Networks
○ Fuzzy Logic
○ Stochastic Search Methods
Introduction to Data Analysis
Introduction to Data Analysis
A simple infographic illustrating the data analysis process
Step 1
Define why you need Data
Analysis
Step 5 Step 2
Interprets results and Begin collecting data
apply them from sources
Step 4 Step 3
Begin analyzing the data Clean through
unnecessary data
Regression Modeling
Regression Modeling
Regression analysis is a statistical method used to examine the relationship
between one or more independent variables and a dependent variable. It
quantifies how changes in independent variables influence the dependent variable,
allowing for prediction and insight into data relationships.
● Simple Linear Regression: Models the relationship between a single
independent and dependent variable using a straight line.
● Multiple Linear Regression: Considers the influence of multiple independent
variables on the dependent variable.
● Polynomial Regression: Captures curvilinear relationships when a straight
line is insufficient.
● Logistic Regression: Predicts the probability of a categorical outcome (e.g.,
yes/no).
Regression Modeling: Types
Regression Modeling
Mathematical Process (Simple Linear Regression)
Regression Modeling
Regression Modeling
Diagram – Regression Line
Regression Modeling
Key Points
● Regression helps in prediction, forecasting, and understanding
relationships.
● Accuracy depends on assumptions: linearity, independence,
normality, homoscedasticity.
● Extended forms include polynomial regression, ridge/lasso
regression, logistic regression for classification.
Multivariate Analysis
Multivariate Analysis (MVA)
Multivariate Analysis is a collection of statistical techniques used to
examine data that arises from more than one dependent variable or
multiple independent variables simultaneously.
Unlike univariate (single variable) or bivariate (two variables) analysis,
multivariate analysis helps understand relationships, patterns, and
dependencies across multiple dimensions at once.
Example: A researcher studies how income, education, and age
(independent variables) together affect spending behavior and savings
(dependent variables).
Multivariate Analysis (MVA)
Objectives
1. Identify relationships among multiple variables.
2. Reduce data dimensionality (Principal Component Analysis).
3. Classify & group data (Cluster Analysis, Discriminant Analysis).
4. Predict outcomes based on multiple predictors (Multivariate
Regression).
5. Find hidden structures in large datasets.
Multivariate Analysis (MVA)
Types of Multivariate Analysis
1. Multivariate Regression Analysis – Predicts multiple dependent variables
from multiple independent variables. Y=XB+EY
2. Principal Component Analysis (PCA) – Reduces data dimensions while
retaining variance.
3. Factor Analysis – Identifies hidden (latent) variables influencing observed
data.
4. Discriminant Analysis – Classifies data into categories.
5. Cluster Analysis – Groups similar data points together.
6. MANOVA (Multivariate Analysis of Variance) – Extension of ANOVA for
multiple dependent variables.
Multivariate Analysis (MVA)
Mathematical Process (Generalized View)
Multivariate Analysis (MVA)
Multivariate Analysis (MVA)
Key Points
● Multivariate analysis deals with datasets having multiple
dimensions.
● It helps in prediction, classification, and pattern recognition.
Widely applied in finance, marketing, biology, medicine, and
AI/ML.
● Techniques like PCA, clustering, MANOVA are part of this field.
Bayes' Theorem
Bayes' Theorem
Bayesian Modeling
Bayesian Modeling
Definition
Bayesian modeling in data analytics is a statistical approach where
uncertainty in data is represented using probability distributions, and
Bayes’ theorem is used to update beliefs about parameters or
hypotheses as new data becomes available.
Unlike traditional (frequentist) methods that provide point estimates,
Bayesian modeling gives a distribution of possible outcomes (posterior
distribution), allowing analysts to incorporate prior knowledge and
quantify uncertainty more effectively.
Bayesian Modeling
Why Use Bayesian Modeling in Data Analytics?
1. Handles uncertainty naturally (through probability distributions).
2. Updates knowledge continuously as new data arrives.
3. Incorporates prior knowledge (domain expertise, historical data).
4. Useful in prediction, decision-making, classification, anomaly
detection, risk assessment.
Bayesian Modeling
Mathematical Framework
Bayesian Modeling
Example in Data Analytics
Customer behavior analysis:
● Prior belief: Customers usually buy product A with probability 0.3.
● Collect new data: Out of 100 customers, 40 buy product A.
● Bayesian model updates prior with data → New posterior belief ~
probability of purchase is ~0.38 with credible interval.
This helps businesses predict sales, personalize recommendations,
and manage risks with quantified uncertainty.
Bayesian Modeling: Bayesian Updating in Data Analytics
Bayesian Modeling
Key Takeaways
● Bayesian modeling is a powerful tool in modern data analytics.
● It allows continuous learning as new data arrives.
● It provides probabilistic insights instead of just point estimates.
● Applications: fraud detection, healthcare analytics, customer
segmentation, demand forecasting, and AI/ML models.
Inference Problem
Inference Problem
Support Vector and Kernel
Methods
Support Vector and Kernel Methods
Definition
● Support Vector Machines (SVMs): Supervised learning models that
classify data by finding the optimal hyperplane that maximizes the
margin between different classes.
● Kernel Methods: Mathematical techniques that implicitly map data
into a higher-dimensional feature space, enabling SVMs to handle
non-linear relationships.
Support Vector and Kernel Methods
Key concepts
● Hyperplane: A decision boundary separating different classes in feature
space.
● Support Vectors: Data points closest to the hyperplane, defining the
margin.
● Kernel Function: A function that computes the inner product between
data points in the transformed feature space. Examples include linear,
polynomial, and Radial Basis Function (RBF) kernels.
● Kernel Trick: The ability of kernel methods to work in higher-dimensional
spaces without explicitly computing the transformations.
Support Vector and Kernel Methods
Example: Classification in flotation processes
● Scenario: Classifying froth images from a sulfur flotation process.
● Approach: SVMs, combined with kernel methods (e.g., RBF kernel and
multiple-kernel functions), can be used to classify froth images into
different appearance categories based on textural features extracted
from the images.
● Application: Accurate classification of froth images, aiding in process
monitoring and optimization.
Analysis of Time Series
Analysis of Time Series
Definition: Time series analysis involves studying data points collected
over time to identify trends, seasonality, cycles, and random fluctuations.
Key aspects
● Linear Systems Analysis: Utilizes linear models to understand and
predict time series behavior, assuming linear relationships between
variables.
● Nonlinear Dynamics: Explores complex and non-linear patterns in
time series data, employing models like Threshold Autoregressive
(TAR) and Autoregressive Conditional Heteroskedasticity (ARCH)
models.
Analysis of Time Series
Components of Time Series
Analysis of Time Series
Example: Financial forecasting
● Scenario: Predicting stock prices, interest rates, or economic indicators.
● Approach: Linear time series models like ARMA(Autoregressive Moving
Average) or ARIMA(Autoregressive Integrated Moving Average) can be
used for forecasting when relationships are relatively stable. Nonlinear
models like ARCH or GARCH may be more suitable for modeling
volatility and capturing non-linear patterns in financial time series.
● Application: Informing investment decisions, risk management
strategies, and economic planning.
Rule Induction
Rule Induction
Definition: Rule induction is a machine learning technique that extracts
classification rules from data, typically in the form of "IF-THEN" statements.
Key concepts
● Sequential Covering: A greedy algorithm that iteratively discovers rules
that cover the positive instances of a class, removing covered instances
from the training data, and repeating the process until all positive
instances are covered.
● Learn-One-Rule: A sub-routine within sequential covering that grows
individual rules by adding conjuncts (conditions) until the rule achieves a
desired accuracy, measured by the ratio of correctly covered positive
instances to all covered instances.
Rule Induction
Example: Predictive maintenance
● Scenario: Predicting machine failures in a factory based on sensor
readings.
● Approach: Rule induction algorithms can analyze historical sensor
data leading to failures and non-failures, generating rules like: "IF
cylinder temperature > 852ºC for 3 consecutive hours THEN machine
failure likely within 24 hours".
● Application: Early alerts of imminent machine breakdowns, enabling
preventative maintenance and reducing downtime.
Rule Induction
Applications
● Expert systems
● Credit scoring
● Medical decision systems.
Neural Networks
Neural Networks
Definition
Neural networks (also Artificial Neural Networks or ANNs) are
computational models inspired by the structure and function of biological
neural networks, capable of learning from data and modeling complex
relationships.
Neural Networks
Key aspects
● Learning and Generalization: The process of adjusting network
weights (parameters) based on training data to minimize errors and
enable accurate predictions on unseen data, i.e., generalize.
● Competitive Learning: An unsupervised learning paradigm where
network nodes compete to respond to input data, with the winning
node adapting its weights.
● Principal Component Analysis (PCA) and Neural Networks: PCA can
be used as a preprocessing step to reduce dimensionality and
simplify the input to neural networks, potentially improving training
efficiency and generalization performance.
Neural Networks
Example: Image recognition
● Scenario: Classifying images (e.g., identifying objects or faces).
● Approach: Convolutional Neural Networks (CNNs) are a type of ANN
particularly well-suited for image processing, learning features like
edges and shapes through multiple layers of processing, and making
classifications in the output layer.
● Application: Facial recognition systems, medical image object
segmentation, and more.
Neural Networks
Applications
● Speech recognition
● NLP
● Computer vision.
Fuzzy Logic
Fuzzy Logic
Definition
Fuzzy logic is a form of logic that allows propositions to be represented with
degrees of truthfulness and falsehood, enabling reasoning with imprecise
and ambiguous information.
Key aspects
● Fuzzy Models from Data: Extracting fuzzy rules and membership
functions from data to represent relationships and build fuzzy models.
● Fuzzy Decision Trees: A generalization of traditional decision trees that
handle attributes with numeric-symbolic values, offering a more
expressive and understandable representation of induced knowledge.
Fuzzy Logic
Example: Control systems
● Scenario: Designing a control system for a process where precise
mathematical models are difficult to obtain.
● Approach: Fuzzy logic can be used to emulate human reasoning and
control actions, encapsulating expertise in terms of linguistic
descriptions and fuzzy inference rules.
● Application: Linear and non-linear control in various engineering
domains, including chemical engineering, robotics, and vehicular
technology.
Fuzzy Logic
Applications
● Control systems (AC, washing machines)
● Decision-making.
Stochastic Search Methods
Stochastic Search Methods
Definition
Stochastic search methods, also known as stochastic optimization,
employ randomness to explore the design space and find optimal
solutions for complex optimization problems.
Key concepts
● Exploration vs. Exploitation: Balancing the search for novel solutions
(exploration) with refining promising existing solutions (exploitation).
● Heuristics: Rules or guidelines, often inspired by natural processes,
that help guide the search process towards optimal solutions.
Stochastic Search Methods
Example: Optimizing hyperparameters in machine learning models
● Scenario: Finding the optimal combination of hyperparameters (e.g.,
kernel function and regularization term C) for a Support Vector
Machine model.
● Approach: Stochastic search methods like Particle Swarm
Optimization (PSO) or Genetic Algorithms (GA) can be used to explore
the hyperparameter space efficiently, identifying the best combination
for the given dataset.
● Application: Improved model performance and generalization ability
of machine learning models.
Stochastic Search Methods
Methods include:
● Genetic Algorithms
● Simulated Annealing
● Particle Swarm Optimization(PSO)
Stochastic Search Methods
Applications
● Feature selection
● Scheduling
● AI game playing.
Stochastic Search Methods
Random search path exploring multiple valleys before finding the global
optimum.