Concept Learnning1
Concept Learnning1
Core Difference
Aspect Statistical Reasoning Probabilistic Reasoning
Question Type “What does data tell us?” “How likely is this event?”
1. Simple probability. P(A). The probability that an event (say, A) will occur.
2. Joint probability. P(A and B). P(A ∩ B). The probability of events A and B occurring together.
3. Conditional probability. P(A|B), read "the probability of A given B." The probability that event
A will occur given event B has occurred.
4. The probability distribution for a discrete random variable is described with a probability
mass function (PMF).
Discrete vs Continuous
Discrete Variable Continuous Variable
Machine learning is similar to data mining, but data mining is the science of discovering unknown
patterns and relationships in data; machine learning applies previously inferred knowledge to new
data to
Make decisions in real-life applications
Computers approximate complex functions from historical data
Rules are not explicitly programmed but learned from data
Learning—A Two-Step Process
1. Model construction:
A training set is used to create the model.
The model is represented as classification rules, decision trees, or mathematical formula
2. Model usage:
The test set is used to see how well it works for classifying future or unknown objects
Importance of Machine Learning
Some tasks cannot be defined well, except by examples (e.g., recognizing people).
Relationships and correlations can be hidden within large amounts of data. to find these
relationships . Machine Learning/Data Mining may be able
Human designers often produce machines that do not work as well as desired in the environments in
which they are used.
The amount of knowledge available about certain tasks might be too large for explicit encoding by
humans (e.g., medical diagnostic).
Environments change over time.
New knowledge about tasks is constantly being discovered by humans. It may be difficult to
continuously re design systems “by hand”
Area of influence for Machine Learning
Statistics: How best to use samples drawn from unknown probability distributions to help decide
from which distribution some new sample is drawn?
Brain Models: Non-linear elements with weighted inputs (Artificial Neural Networks) have been
suggested as simple models of biological neurons.
Adaptive Control Theory: How to deal with controlling a process having unknown parameters that
must be estimated during operation?
Psychology: How to model human performance on various learning tasks?
Artificial Intelligence: How to write algorithms to acquire the knowledge humans are able to
acquire, at least, as well as humans?
Evolutionary Models: How to model certain aspects of biological evolution to improve the
performance of computer programs
The Machine Learning framework consists of problem definition, data collection, preprocessing,
feature engineering, model selection, training, evaluation, and deployment. It ensures systematic
development of predictive models from raw data to final prediction.
A set of attributes used to describe a given object is called an attribute vector (or feature vector).
The distribution of data involving one attribute (or variable) is called univariate.
The type of an attribute is determined by the set of possible values the attribute can have. Attributes
can be nominal, binary, ordinal, or numeric.
A dataset is a complete collection of related data used for analysis or machine learning. A data object
is a single instance or record within that dataset.
Attribute types in ML
Nominal categories, states, or “names of things”
Eg. Hair_color= {black, brown, blond, red, grey, white}
Binary Nominal attribute with only 2states (0and 1)
Symmetric binary: both outcomes equally important e.g. Gender
Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs.
negative)
Ordinal Values have a meaningful order (ranking) but magnitude between successive values is
not known. Size ={small, medium, large}, grades,
Numeric: quantity (integer or real-valued)
Measured on a scale of equal-sized units
Values have order E.g. Temperature in C˚or F˚, calendar dates Interval-scaled
No true zero-point
Ratio-Scaled Inherent zero-point E.g. Temperature in Kelvin, length, counts,
Discrete Attribute Has only a finite or countable infinite set of values E.g. Zip codes,
profession, Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute Has real numbers as attribute values E.g. Temperature, height, or weight
practically, real values can only be measured and represented using a finite number of digits
Continuous attributes are typically represented as floating-point variables (float, double, long
double)
Measure of central tendency
1. Mean
How to solve if the mean is sensitive to extreme values (outliers), we can use the median as a
robust measure of central tendency. Alternatively, we may remove outliers using statistical
methods or use a trimmed mean.
2. Median
3. Mode
4. Midrange
Measures of data dispersion
1. Standard deviation
2. Range
3. Five number summary min, Q 1, median, Q 3, max
Min=Q1-1.5xIQR, max=Q3+1.5xIQR
4. IQR=Q3-Q1
Outlier: usually, a value falling at least1.5 x IQR above the third quartile or below the first
quartile
Basic properties of the standard deviation
Measures spread about the mean and should be used only when the mean is chosen as the measure
of the center
σ =0 only when there is no spread, that is, when all observations have the same value. Otherwise σ
>0
Proximity refers to a similarity or dissimilarity
Measures of Data Quality
Accuracy: How well does a piece of information reflect reality? [Correct/wrong]
Completeness: Does it fulfill your expectations of what’s comprehensive? [recorded/not]
Consistency: Does information stored in one place match relevant data stored elsewhere?
Timeliness: Is your information available when you need it?
Validity: Is information in a specific format, does it follow business rules?
Uniqueness: Is this the only instance in which this information appears in the dataset?
Why Data Preprocessing?
Data in the real world is dirty
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
Noisy: containing errors or outliers that deviate from the expected
Inconsistent: containing discrepancies in codes or names: lack of compatibility (e.g Some attributes
representing a given concept may have different names in different databases)
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality:
A well-accepted multi-dimensional view: accuracy, completeness, consistency,
timeliness, believability, value added, interpretability, accessibility
Broad categories: intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
Data cleaning fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration is integration of multiple databases, data cubes, files, or notes
Data transformation
Normalization (scaling to a specific range)
Aggregation.
Data comparison
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
Data discretization: with particular importance, especially for numerical data
Data aggregation, dimensionality reduction, data compression, generalization
Important for Big Data Analysis
Data discretization (for numerical data)
It refers to transferring the data sets which is continuous into discrete interval values
Data preparation, cleaning, and transformation comprise the majority of the work in a data mining
application (90%).
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
How to Handle Missing Data
Ignore the tuple: usually done when class label is missing classification—not effective in certain
cases)
Fill in the missing value manually: tedious and infeasible (assuming the task is
Use a global constant to fill in the missing value: E.g., “unknown”, a new class?! Simple but not
recommended as this constant may form some interesting pattern and mislead decision process
Use the attribute mean: for all samples belonging to the same class to fill in the missing value with
the mean value of attributes
Use the most probable value: fill in the missing values by predicting its value from correlation of
the available values
Except the first two approach, the rest filled values are incorrect and the last two are common
Data imbalance (class imbalance) happens when the number of samples in one class is much larger
than in another class in a dataset.
For example:
Fraud detection: 99% non-fraud, 1% fraud
Medical diagnosis: 95% healthy, 5% disease
Why is Data Imbalance a Problem?
1. The model becomes biased toward the majority class
2. Poor recall for minority class
3. Misleading accuracy score
4. Important events (fraud, disease, defects) get missed
How to solve data imbalance problem
1. Data-Level Methods (Resampling)
✅ A. Oversampling (Increase minority class)
Random oversampling
SMOTE (Synthetic Minority Over-sampling Technique) – generates synthetic samples
ADASYN
✔ Good when dataset is small
❌ Risk of overfitting (if simple duplication)
✅ B. Undersampling (Reduce majority class)
Random undersampling
Tomek links
NearMiss
✔ Faster training
❌ May lose useful information
2. Algorithm-Level Methods
✅ A. Use Class Weights
Most ML models support weighting:
Logistic Regression
Decision Trees
Random Forest
Neural Networks
This penalizes mistakes on minority class more.
✅ B. Use Specialized Algorithms
Balanced Random Forest
XGBoost with scale_pos_weight
LightGBM with is_unbalance=True
3. Evaluation Strategy Fix
Never rely on accuracy alone.
Use:
Precision
Recall
F1-score
ROC-AUC
PR-AUC (better for heavy imbalance)
Confusion matrix is very important.
When to Use What?
Situation Best Method
Small dataset SMOTE
Huge dataset Under sampling
Deep learning Class weights
Severe imbalance (1:1000) Combine oversampling + class weighting
Fraud/medical tasks Focus on Recall + PR-AUC
Best Practice (Recommended)
1. Start with class weights
2. Try SMOTE
3. Evaluate using F1 / PR-AUC
4. Tune decision threshold (very important!)
Feature Selection is a process that chooses an optimal subset of features according to a certain
criterion.
Why we need Feature Selection (FS)?
To improve performance (in terms of speed, predictive power, simplicity of the model).
To visualize the data for model selection.
To reduce dimensionality and remove noise.
Supervised learning is the prevalent method for constructing predictors from data,
Learning a function that maps an input to an output based on example input-output pairs.
Supervised learning algorithms are broadly categorized into classification and regression,
Classification problem, if the target variable is categorical,
Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True
or False, Spam or Not Spam, etc.
Regression problem, if the target variable is continuous
Regression algorithms are used to predict continuous values such as price, salary, age, etc.
You can estimate the selling price of a living house in a given city.
Classification is a two-step process: a model construction (learning) phase and a model usage
(applying) phase.
In ML, single classification assigns each instance to one of several mutually exclusive classes, while
multi-label classification allows an instance to belong to multiple classes simultaneously
Approaches to solve multi-label classification problem
1. Problem transformation: Transforms a multi-label classification problem to multiple classes
2. Single-label classification problems, which can be called adapting data to the algorithm.
Binary Relevance (BR): Multi-label into single-label problems.
Advantages
Computationally efficient
Disadvantages
Does not capture the dependence relations among the class variables
Label Power-set (LP) : Multi-label into single multi-class– consider each label set as a class.
Advantages
Learns the full joint of the class variables and each of the new class values maps to a label
combination
Disadvantages
The number of choices in the new class can be exponential (|YLP| = O(2d)) and Learning a multi-class
classifier on exponential choices is expensive
Classifier Chains (CC): resolves the BR limitations by making a label correlation task.
Limitation: The result can vary for different order of chains Solution: ensemble
3. Adapted Algorithm: Perform multi-label classification, rather than transforming the problem
into different subsets of the problem.
4. Ensemble Approaches: learning multiple classifier systems trains multiple hypotheses to solve
the same problem.
5. Multi-label problems can essentially be broken down into sets of binary problems without much
loss of information,
Regression methods
Regression analysis is the process of estimating a functional relationship between X and Y.
A regression equation is often used to predict a value of Y for a given value of X.
Another way to study the relationship between two variables is correlation. It involves measuring the
direction and the strength of the linear relationship
Logistic regression estimates the probability of an event occurring, such as voting or not voting,
based on a given data set of independent variables.
It is a supervised ML algorithm widely used for binary classification tasks, such as identifying
whether an email is spam or not.
Binary logistic regression is for a dichotomous criterion (i.e., 2-level variable)
The other type of logistic regression is Multinomial logistic regression which is for a multi
categorical criterion (i.e., a variable with more than 2 levels)
KNN
The k-Nearest Neighbors (KNN) family of classification and regression algorithms is often
referred to as memory-based learning or instance-based learning.
Sometimes, it is also called lazy learning.
It is a non-parametric, supervised learning classifier, which uses proximity to make classifications
or predictions about the grouping of individual data
K-NN does not build a model from the training data.
The nearest-neighbor is that the properties of any particular input X are likely to be similar to those
of points in the neighborhood of X
How to choose the value of k for the KNN Algorithm?
The value of k in KNN decides how many neighbors the algorithm looks at when making a prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can make the predictions more stable.
But if k is too large, the model may become too simple and miss important patterns, and this is
underfitting.
So, k should be picked carefully based on the data.
Three points are required to deal with KNN
The set of stored records
Distance Metric to compute the distance between records
The value of k, the number of nearest neighbors to retrieve
The most commonly used distance function is the Euclidean distance
D(x, y)=
K-nearest neighbor’s algorithm steps
Step1. Determine the parameter k number of nearest neighbors
Step2. Calculate the distance between the query instance and all the training samples
Step3. Sort the distance and determine nearest neighbors based on the k-th minimum distance
[Link] categoryYof the nearest neighbor
Step 5. Use the simple majority of the category of nearest neighbors as the prediction value of the
query instance
Decision Tree for classification
Decision tree structure
Root node: beginning of a tree and represents entire population being analyzed.
It has no incoming edges and zero or more outgoing edges
Internal node: denotes a test on an attribute
Each of which has exactly one incoming edge and two or more outgoing edges
Branch: represents an outcome of the test
Leaf nodes: represent class labels or class distribution
Each of which has exactly one incoming and no outgoing edges.
The algorithm for the decision tree involves three general phases as stated below:
Phase I: Find Splitting Criteria based on all the sample set at the splitting point (node) (attribute
selection measure)
Phase II: Split all the sample data based on the splitting criteria and form branches and each
successor nodes will have the samples
Phase III: Do phase one and phase two iteratively until the stopping criterion is fulfilled
Attribute selection measures
Information theory
Information Gain (ID3 algorithm)
Gain Ratio (C4.5 algorithm)
Gini Index (CART algorithm)
Information theory provides a mathematical basis for measuring the information content.
Information Gain All attributes are assumed to be categorical
Stopping Criteria
Some conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
There are no samples left in the branching direction
Some user-defined stopping criterion met, such as
Depth threshold
The number of samples in the node becomes below some value
Strength
In practice: One of the most popular methods. Why?
Very comprehensible– the tree structure specifies the entire decision structure
Easy for decision makers to understand model’s rational
Relatively easy to implement
Very fast to run (to classify examples) with large data sets
Splits are simple and not necessarily optimal
Overfitting (may not have the same performance when deployed, complex and more specific
conditions)
Underfitting (may over generalize during training, less conditions)
Weakness:
Overfitting means that the model performs poorly on new examples (e.g. testing examples) as it is
too highly trained to the specific (non-general) nuances of the training examples
Underfitting means that the model performs poorly on new examples as it is too simplistic to
distinguish between them (i.e. has not picked up the important patterns from the training examples)
Bayesian Classification
P(h|X) = P(h|X)= P(x| h) P(h) /p(X)
Where h is hypothesis, X is a training data
Compared to Decision tree, Bayesian classifiers have also exhibited high accuracy and speed when
applied to large databases
The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis
This greatly reduces the computation cost:-only counts the class distribution
SVM
Support Vector Machine (SVM) is a relatively new class of successful supervised learning methods
for classification and regression
It is linear and non-linear functions, and they have an efficient training algorithm
It is a method of Support Vector Machines, and a clever way to prevent overfitting.
This number of data points is called the Vapnik-Chervonenkis dimension.
The model does not need to shatter all sets of data points of size h. One set is sufficient.
SVMs are linear classifiers that find a hyperplane to separate two classes of data,
In the 2D space, the decision boundary is a line, whereas in the 3D space, the decision boundary is
nonlinear or a hyperplane
Our aim is to find such a hyperplane f(x) = sign(w•x+b), that correctly classify our data.
A good separation is achieved by the hyperplane that has the largest distance to the neighboring data
points of both classes.
The vectors (points) that constrain the width of the margin are the support vectors.
Misclassification may happen
Non-Separable Data:
Slack Variables
The slack variable measures how far a data point is from the margin or on the incorrect side of the
hyperplane.
The kernel trick is a technique in ML that allows you to implicitly map data into a higher-
dimensional space without explicitly performing the transformation.
This method is particularly useful in scenarios where the transformation to a higher dimensional space
is computationally expensive or even impossible to perform directly.
A kernel function is some function that corresponds to an inner product in some expanded feature
space.
kernel trick The SVM kernel is a function that takes low dimensional input space and transforms it to
a higher dimensional space i.e. it converts not separable problem to separable problem
Kernel function Φ: map data into a different space to enable linear separation.
Convert linear learning machines (such as linear SVM) into non-linear ones (such as kernel SVM) by
doing an inner product between data points.
Kernel functions are very powerful. They allow SVM models to perform separations even with very
complex boundaries.
Why use kernels?
Make non-separable problem separable.
Map data into better representational space
Kernel helps to find a hyperplane in the higher dimensional space without increasing the
computational cost
Strengths
Training is relatively easy
No local optimal, unlike in neural networks
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can be controlled explicitly
Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
Weaknesses
Need to choose a “good” kernel function.
ANN
Artificial Neural Networks (ANNs) are Neural Networks (NN) inspired by biological neural networks.
A neural network is composed of several processing elements called neurons that are connected,
and their output depends on the strength of the connections.
Learning involves adapting weight factors that represent connection strength.
In general
Neural networks (NN) or artificial neural networks (ANN)
The resulting model from neural computing is called an artificial neural network (ANN) or neural
network
NN represents a brain metaphor for information processing
Computer technology that attempts to build computers that will operate like a human brain. The
machines possess simultaneous memory storage and work with ambiguous information
• Each input has an associated weight w, which can be modified to model synaptic learning.
Properties of ANN
Learning from examples labeled or unlabeled
Additivity changing the connection strengths to learn things
Non-linearity the non-linear activation functions are essential
Fault tolerance If one of the neurons or connections is damaged, the whole network still works quite
well
Thus, they might be better alternatives than classical solutions for problems characterized by: high
dimensionality, noisy, imprecise or imperfect data; and a lack of a clearly stated mathematical solution
or algorithm
ANNs model
Neural Network learns by adjusting the weights so as to be able to correctly classify the training
data and hence, after testing phase, to classify unknown data.
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and incomplete data
Bias can be incorporated as another weight clamped to a fixed input of +1.0
This extra free variable (bias) makes the neuron more powerful
All data must be normalized.
(I.e. all values of attributes in the database are changed to contain values in the internal [0, 1] or [-1,
1]) Neural Network can work with data in the range of (0, 1) or (-1, 1)
Two basic data normalization techniques
1. Max-Min normalization
Rescales data into a fixed range using minimum and maximum values
2. Decimal Scaling normalization
Normalizes data by dividing by a power of 10 so that values fall within (-1,1).
Comparison Table
Feature Min–Max Decimal Scaling
Formula (x - min)/(max-min) x / 10j
Output Range Usually [0,1] (-1,1)
Needs min/max? Yes No
Sensitive to outliers? Yes Yes
Common in ML? Very common Rare
One of the most popular neural network models is the multi-layer perceptron (MLP). Output layer
In an MLP, neurons are arranged in layers. There is one input layer, one output layer, and several
(or many) hidden layers.
Performance evaluation
It is
How to obtain a reliable estimate of performance?
Performance of a model may depend on other factors besides the learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among competing models?
Absolute and Mean Square Error
Refers to the error committed to classify an object to the desired class, which is the difference
between the desired value and the predicted value
For all these measures, smaller values usually indicate a better fitting model.
Accuracy
Accuracy is a measure of how well a binary classifier correctly identifies or excludes a condition.
It’s the proportion of correct predictions among the total number of cases examined.
It assumes equal cost for all classes
Misleading in unbalanced datasets
It doesn’t differentiate between different types of errors
Example
Cancer Dataset: 10000 instances, 9990 are normal, 10 are ill , If our model classified all
Medical diagnosis: 95 % healthy, 5% disease.
E-Commerce: 99 % don’t buy, 1 % buy.
Security: 99.999 % of citizens are not terrorists.
Limitation of accuracy
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10 instances as normal accuracy will be 99.9 %
If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any class of above example
Binary classification Confusion Matrix
Confusion matrix is The notion of a confusion matrix can be usefully extended to the multiclass case
(i,j) cell indicate how many of the i-labeled examples were predicted to be j
• There are four quadrants in the confusion matrix, which are symbolized as below.
• True Positive (TP) : The number of instances that were positive (+) and correctly classified as
positive (+).
• False Negative (FN): The number of instances that were positive (+) and incorrectly classified as
negative (-).
• False Positive (FP): The number of instances that were negative (-) and incorrectly classified as (+).
• True Negative (TN): The number of instances that were negative (-) and correctly classified as (-).
Given a dataset of P positive instances and N negative instances:
TN rate= TN/TN+FP
FN rate= FN/FN+TP
Error rate=1-success rate
Accuracy=Success rate and Loss=error rate
Of all patients where we classified 𝑦 = 1, what fraction actually has the disease?
Precision (How much we are precise in the detection)
Recall (also called Sensitivity in some fields) measures the proportion of actual positives which are
correctly identified as such (e.g. the percentage of sick people who are identified as having the
condition);
Recall (How much we are good at detecting)
Of all patients that actually have the disease, what fraction did we correctly detect as having the
disease?
Sensitivity: Measures the classifier ability to detect positive classes (its positivity)
2 Important Definitions
From confusion matrix:
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
✅ True Positive Rate (TPR)
Also called Recall or Sensitivity
TPR=TP/TP+FN
✅ False Positive Rate (FPR)
FPR=FP/FP+TN
3 What Does ROC Curve Show?
X-axis → FPR
Y-axis → TPR
Each point = different decision threshold
Threshold Effect
If threshold is very low:
Almost everything predicted positive
TPR high
FPR high
If threshold is very high:
Almost everything predicted negative
TPR low
FPR low
ROC shows tradeoff between TPR and FPR.
✅ 4 What is AUC?
AUC = Area Under the ROC Curve
It measures overall model performance.
0≤AUC≤1
5 Geometric Meaning of AUC
AUC = probability that:
A randomly chosen positive sample
is ranked higher than
a randomly chosen negative sample.
✅ 6 How to Calculate AUC (Simple Example)
Suppose ROC points are:
FPR TPR
0 0
0.2 0.6
0.5 0.8
1 1
We compute area using trapezoidal rule.
Example first trapezoid:
Area=(0.2−0)×(0+0.6)/2
Add all segments.
Total area = AUC.
✅ 7 Why Use AUC?
✔ Evaluates ranking quality
✔ Independent of classification threshold
✔ Good for imbalanced datasets
✔ Measures discrimination ability
❌ Limitations
✖ Only for binary classification (basic ROC)
✖ Does not reflect actual probability calibration
✖ Can be misleading in highly skewed data
✅ 8 ROC vs Accuracy
Accuracy ROC-AUC
Depends on threshold Independent of threshold
Accuracy ROC-AUC
Bad for imbalanced data Better for imbalanced data
Measures exact predictions Measures ranking ability
✅ 9 When to Use ROC-AUC
Use when:
Binary classification
Class imbalance
Need threshold-independent evaluation
Comparing classifiers
Examples:
Disease detection
Fraud detection
Spam filtering
✅ ROC for Multiclass
Use:
1. One-vs-Rest (OvR)
2. Micro-average AUC
3. Macro-average AUC
Key Exam Differences
ROC vs Precision-Recall Curve
ROC better when classes balanced
PR curve better when positive class is rare
5-Mark Exam Definition
ROC curve plots True Positive Rate against False Positive Rate at various thresholds. The Area Under
the Curve (AUC) summarizes overall classification performance, where a higher AUC indicates better
model discrimination ability.
Quick Memory Trick
AUC ≈ How good is the model at ranking positives above negatives?
Unsupervised Learning
For unsupervised learning, we observe only the features X The goal is then to predict Y using X 1 , X
2 ,…, Xn, 1 , 1 , X 2 ,…, Xn. and we are not interested in prediction, because we do not have an
associated response variable Y .
Unsupervised learning aims to find the underlying structure or the distribution of data, so that we can
explore the data to find some intrinsic structures in it.
It means that the data have no target attribute, but we want to find some intrinsic structures in them
that can be used to cluster the input data in classes based on their statistical properties only.
Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the
analysis, such as prediction of a response.
However, techniques for unsupervised learning are of growing importance in a number of fields:
Subgroups of breast cancer patients grouped by their gene expression measurements,
Groups of shoppers characterized by their browsing and purchase histories,
Movies grouped by the ratings assigned by movie viewers. It is often easier to obtain unlabeled
data from a lab instrument or a computer than labeled data, which can require human
intervention.
Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set that
determine the intrinsic grouping in a set of unlabeled data
It makes this concrete; we must define what it means for two or more observations to be similar (near)
or different (far away). Similar to one another within the same cluster and Dissimilar to the objects in
other clusters using clustering analysis
It is represented by a single point, known as the centroid (or cluster center) of the cluster.
Centroid is computed as the mean of all data points in a cluster = Cj=∑ x𝑖
Cluster boundary is decided by the farthest data point in the cluster
Example1: groups people of similar sizes together to make “small”, “medium”, and “large” T-Shirts.
Example2: In marketing, segment customers according to their similarities To do targeted
marketing.
Example 3: Given a collection of text documents, we want to organize them according to their
content similarities, To produce atopic hierarchy
Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
The quality of a clustering result depends on the algorithm, the distance function, and the application.
How is clustering subjective? It is a technique for finding similarity groups in data
Distance functions
Distance functions for numeric attributes
Most commonly used functions are
Euclidean distance and
Manhattan (city block) distance
Distance functions
Distance functions for binary and nominal attributes
Binary attribute: has two values or states but no ordering relationships,
e.g., Gender: male and female.
How information is measured, while statistical learning explains how models learn from data.
1. Information Theory
Information Theory studies the quantification, storage, and transmission of information. In ML,
it is used to measure uncertainty, impurity, and information gain.
3. Key Concepts in Information Theory
2.1 Information
Information is the reduction in uncertainty after observing an event.
If an event is rare, it carries more information.
2.2 Entropy
Entropy measures the uncertainty or impurity in a dataset.
H(S)=−∑pilog2pi
Where:
pi = probability of class i
Example
If a dataset has:
50% Pass
50% Fail
H=−(0.5log20.5+0.5log20.5)=1
👉 Maximum uncertainty
2.3 Information Gain
Information Gain measures the reduction in entropy after splitting data on an attribute.
IG(S,A)=H(S)−∑∣Sv∣∣S∣H(Sv)
Used in Decision Trees to select the best feature.
2.4 Mutual Information
Measures how much information one variable provides about another.
I(X;Y)=H(X)−H(X∣Y)
Used in:
Feature selection
Dependency analysis
3. Role of Information Theory in ML
Concept ML Application
Entropy Decision tree splitting
Information Gain Feature selection
Mutual Information Dependency measurement
KL-Divergence Model comparison
4. Statistical Learning
Statistical Learning is the framework for understanding how models learn patterns from data using
statistical principles.
5. Key Concepts in Statistical Learning
5.1 Population vs Sample
Population: Entire dataset
Sample: Subset used for learning
5.2 Model
A mathematical function that maps inputs to outputs.
y=f(x)+ϵ
Where:
ϵ = noise
5.3 Empirical Risk Minimization (ERM)
Learning by minimizing average loss on training data.
f^=argmin1n∑L(yi,f(xi))
5.4 Bias–Variance Tradeoff
Bias: Error due to oversimplification
Variance: Error due to sensitivity to data
Goal: Balance both.
5.5 Overfitting and Underfitting
Overfitting: Model fits noise
Underfitting: Model too simple
6. Role of Statistical Learning in ML
Concept Application
Regression Prediction
Classification Decision making
Hypothesis testing Model validation
Regularization Overfitting control
2. Data Preprocessing
Data preprocessing is the process of cleaning, transforming, and organizing raw data into a suitable
format for machine learning models.
📌 “Garbage in → Garbage out”
Good preprocessing leads to better models.
3. Steps in Data Preprocessing
3.1 Data Cleaning
Removes or fixes incorrect, incomplete, or inconsistent data.
a) Handling Missing Values
Methods:
Remove rows/columns
Replace with mean, median, or mode
Use prediction methods
Example:
Missing exam score → replace with class average
b) Handling Noise
Noise = random errors or outliers
Methods:
Smoothing
Outlier detection
Binning
c) Handling Duplicate Data
Remove repeated records
3.2 Data Integration
Combines data from multiple sources.
Example:
Student academic records + attendance system
3.3 Data Transformation
Converts data into a suitable format.
a) Normalization / Scaling
Ensures features are on the same scale.
Example Methods:
Min–Max Scaling
Z-score Normalization
b) Encoding Categorical Data
Converts non-numeric data into numbers.
Examples:
Label Encoding
One-Hot Encoding
Gender: Male → 1, Female → 0
c) Feature Construction
Creating new features from existing ones.
Example:
Total marks = test1 + test2 + final
3.4 Data Reduction
Reduces data size while preserving important information.
Methods:
Feature selection
Dimensionality reduction (PCA)
Sampling
3.5 Data Discretization
Converts continuous values into categories.
Example:
Score → Low, Medium, High
3.6 Data Splitting
Divide data into:
Training set (70–80%)
Testing set (20–30%)
Sometimes also:
Validation set
4. Importance of Data Preprocessing
✔ Improves model accuracy
✔ Reduces training time
✔ Avoids bias and errors
✔ Makes data ML-ready
5. Example (Student Performance Dataset)
Raw Data:
Study Hours = 3, Attendance = ?, Score = 75
After Preprocessing:
Study Hours = 3, Attendance = 80, Score = 75
6. Summary (Exam-Oriented)
Data is raw information used in ML
Preprocessing prepares data for learning
Key steps: cleaning, transformation, reduction
Good preprocessing improves ML performance
Concept Learning in Machine Learning
Concept Learning is a fundamental topic in Machine Learning (ML) that deals with learning general
concepts or patterns from specific examples. It is one of the earliest and simplest forms of
supervised learning.
Concept Learning is the task of inferring a Boolean-valued function from labeled training
examples.
A concept is a rule or function that classifies objects as positive (True) or negative (False).
Example:
“A day is a good day for playing football if the weather is sunny and temperature is
moderate.”
2. Key Components of Concept Learning
2.1 Instance Space (X)
The set of all possible examples.
Example:
8. Example (Simple)
If all positive examples are:
Sunny
Mild temperature
Then the learned concept might be:
Sky = Sunny AND Temperature = Mild
9. Applications of Concept Learning
Email spam classification
Medical diagnosis (disease vs no disease)
Student performance prediction
Fault detection systems
10. Advantages and Limitations
Advantages
✔ Simple and interpretable
✔ Good for teaching ML fundamentals
Limitations
✖ Works mainly with Boolean concepts
✖ Sensitive to noise
✖ Not scalable for large datasets
11. Summary
Concept learning is about learning rules from examples
It is a form of supervised learning
Key ideas: instance space, hypothesis space, version space
Algorithms: Find-S and Candidate Elimination
Forms the foundation for more advanced ML techniques
Probabilistic vs Statistical Reasoning in Machine Learning
In Machine Learning (ML), probabilistic reasoning and statistical reasoning are closely related but
serve different purposes. Understanding the distinction is important for exams, assignments, and real-
world ML system design.
1. Overview
Aspect Probabilistic Reasoning Statistical Reasoning
Focus Handling uncertainty Learning from data samples
Main Question What is the likelihood of an event? What can we infer from data?
Basis Probability theory Statistics
Output Probability distributions Estimates, models, tests
2. Probabilistic Reasoning
Probabilistic reasoning uses probability theory to represent and reason under uncertainty. It
models uncertainty explicitly using probabilities.
2.2 Key Concepts
Random variables
Prior probability
Conditional probability
Bayes’ Theorem
Joint and marginal distributions
Bayes’ Theorem:
P(H∣D)=P(D∣H)P(H)P(D)
2.3 Example
Spam Classification
P(Spam) = 0.3
P("free" | Spam) = 0.7
P("free" | Not Spam) = 0.1
Compute:
P(Spam∣"free")P(Spam|"free")P(Spam∣"free")
The model reasons probabilistically to decide whether an email is spam.
2.4 ML Algorithms Using Probabilistic Reasoning
Naïve Bayes Classifier
Bayesian Networks
Hidden Markov Models (HMM)
Probabilistic Graphical Models
2.5 Strengths and Limitations
Strengths
✔ Explicit uncertainty modeling
✔ Works well with missing data
✔ Strong theoretical foundation
Limitations
✖ Requires correct probability assumptions
✖ Computationally expensive for large models
3. Statistical Reasoning
Statistical reasoning focuses on drawing conclusions from data, often using samples to make
inferences about a population.
3.2 Key Concepts
Sampling
Estimation (mean, variance)
Hypothesis testing
Confidence intervals
Regression analysis
3.3 Example
Student Performance Prediction
Using past student scores:
Compute average score
Fit a regression model
Test whether study hours significantly affect performance
This is statistical inference.
3.4 ML Algorithms Using Statistical Reasoning
Linear Regression
Logistic Regression
k-Nearest Neighbors (k-NN)
Support Vector Machines (SVM)
3.5 Strengths and Limitations
Strengths
✔ Data-driven
✔ Scalable to large datasets
✔ widely used in practice
Limitations
✖ Assumes data represent the population
✖ often does not model uncertainty explicitly
4. Key Differences (Exam-Oriented)
Feature Probabilistic Statistical
Uncertainty Explicitly modeled Often implicit
Prior knowledge Uses priors Rarely uses priors
Output Probability of outcomes Parameter estimates
Decision making Bayesian Frequentist
5. Relationship between the Two
Probabilistic reasoning is about modeling uncertainty
Statistical reasoning is about learning parameters from data
Modern ML often combines both
📌 Example:
Bayesian Linear Regression
Probability → Model uncertainty
Statistics → Estimate parameters
6. Simple Analogy
Probabilistic reasoning:
“There is a 70% chance it will rain tomorrow.”
Statistical reasoning:
“Based on 10 years of data, rainfall increases in July.
7. Summary
Probabilistic reasoning answers “how likely?”
Statistical reasoning answers “what can we infer from data?”
Both are essential for ML
Bayesian ML bridges the two approaches
How machine learning works
1. Basic Idea
👉 Data → Learning → Prediction/Decision
Instead of writing rules by hand, we:
1. Give the machine data
2. The machine learns patterns
3. It uses those patterns to predict or decide on new data
2. Main Steps of Machine Learning
Step 1: Data Collection
Data is gathered from different sources.
Examples:
Student scores and attendance
Emails (spam or not spam)
Images and text
Sensor or log data
Step 2: Data Preparation (Preprocessing)
Raw data is cleaned and prepared.
Includes:
Removing missing or incorrect values
Encoding text or categories into numbers
Normalizing or scaling data
Splitting data into:
o Training set
o Testing set
Step 3: Choosing a Model
A model is a mathematical representation of patterns.
Examples of models:
Linear Regression
Decision Tree
Naïve Bayes
Neural Networks
Step 4: Training the Model
The model learns by:
Making predictions on training data
Comparing predictions with actual answers
Adjusting parameters to reduce error
This process is called optimization.
Step 5: Evaluation
The trained model is tested on unseen data.
Evaluation metrics:
Accuracy
Precision
Recall
Mean Squared Error (MSE)
Step 6: Prediction / Deployment
The model is used in real life to:
Predict outcomes
Classify new data
Support decision-making
3. Types of Machine Learning
3.1 Supervised Learning
Uses labeled data (input + correct output).
Examples:
Student performance prediction
Spam detection
Algorithms:
Linear Regression
Logistic Regression
SVM
k-NN
3.2 Unsupervised Learning
Uses unlabeled data.
Examples:
Customer clustering
Anomaly detection
Algorithms:
K-Means
Hierarchical Clustering
PCA
3.3 Reinforcement Learning
Learns by trial and error using rewards.
Examples:
Game playing (Chess, Go)
Robotics
4. Simple Example (Student Performance)
1. Input data:
o Study hours
o Attendance
o Assignment scores
2. Output:
o Final grade
3. Model learns the relationship:
More study hours → Higher grade
4. Predicts grade for a new student.
5. Why Machine Learning Works
✔ Large amount of data
✔ Powerful algorithms
✔ Increased computing power
6. Real-World Applications
Healthcare (disease prediction)
Education (student performance analysis)
Banking (fraud detection)
Security (intrusion detection)
Recommendation systems (YouTube, Netflix)
7. Limitations of Machine Learning
Needs quality data
Can be biased
Not always explainable
Requires expertise
8. Summary
Machine Learning lets systems learn from data
Works through data → model → learning → prediction
Used in many real-world applications
Foundation of Artificial Intelligence (AI)
Traditional Programming vs Machine Learning Approach
Traditional Programming and Machine Learning (ML) are two different ways of solving problems
using computers. The key difference lies in how rules and decisions are created.
1. Basic Idea
Traditional Programming
👉 Rules + Data → Output
Humans write explicit rules.
The computer follows those rules exactly.
Machine Learning
👉 Data + Output → Rules (Model)
The machine learns rules automatically from data.
Humans provide data and learning algorithms.
2. Working Process Comparison
Aspect Traditional Programming Machine Learning
Rule creation Written by programmer Learned from data
Flexibility Low High
Adaptability Needs reprogramming Learns automatically
Handling complexity Difficult Efficient
Data dependency Low High
5. Summary
ML problems are mainly:
o Supervised
o Unsupervised
o Semi-supervised
o Reinforcement
Each class solves different types of problems
Choosing the right class is crucial for model success
Areas of Influence of Machine Learning
Machine Learning (ML) has a wide area of influence across many sectors because it enables systems
to learn from data, make predictions, and improve over time. Below are the major domains where
ML plays a significant role, explained clearly and with examples (exam-oriented).
1. Education 🎓
ML improves teaching, learning, and administration.
Applications
Student performance prediction
Dropout risk analysis
Personalized learning systems
Automated grading
Example:
Predicting at-risk students in private colleges using historical academic data.
2. Healthcare 🏥
ML assists in diagnosis, treatment, and patient management.
Applications
Disease prediction (diabetes, cancer)
Medical image analysis (X-ray, MRI)
Drug discovery
Patient monitoring systems
3. Banking and Finance 💰
ML helps in risk management and automation.
Applications
Fraud detection
Credit scoring
Stock price prediction
Customer behavior analysis
4. Business and Marketing 📊
ML enhances decision-making and customer engagement.
Applications
Sales forecasting
Recommendation systems
Customer segmentation
Demand prediction
Example:
Product recommendations on Amazon or Netflix.
5. Cyber security and Networking 🔐
ML detects and prevents security threats.
Applications
Intrusion Detection Systems (IDS)
Malware detection
Anomaly detection in networks
Spam filtering
6. Agriculture 🌾
ML supports smart farming and food security.
Applications
Crop yield prediction
Disease detection in plants
Weather forecasting
Smart irrigation systems
7. Transportation and Smart Cities 🚦
ML improves efficiency and safety.
Applications
Traffic prediction
Self-driving vehicles
Route optimization
Smart parking systems
8. Manufacturing and Industry 🏭
ML enhances productivity and quality.
Applications
Predictive maintenance
Defect detection
Supply chain optimization
Robotics automation
9. Natural Language Processing (NLP)
ML enables machines to understand human language.
Applications
Speech recognition
Machine translation
Chatbots and virtual assistants
Sentiment analysis
10. Computer Vision
ML allows machines to interpret images and videos.
Applications
Face recognition
Surveillance systems
Object detection
Medical image diagnosis
11. E-Commerce and Retail 🛒
ML improves user experience and operations.
Applications
Price optimization
Recommendation engines
Customer churn prediction
Inventory management
12. Government and Public Services
ML supports policy and service delivery.
Applications
Crime prediction
Tax fraud detection
Population analysis
Disaster management
13. Entertainment and Media 🎬
ML personalizes content and improves production.
Applications
Music and video recommendation
Game AI
Content moderation
14. Research and Science 🔬
ML accelerates discovery and innovation.
Applications
Climate modeling
Astronomical data analysis
Scientific simulations
15. Summary Table (Exam-Ready)
Area Influence of ML
Education Performance prediction, personalization
Healthcare Diagnosis, imaging
Area Influence of ML
Finance Fraud detection, forecasting
Security Intrusion & anomaly detection
Agriculture Yield & disease prediction
Transport Autonomous systems
Industry Predictive maintenance
NLP & Vision Language & image understanding
16. Conclusion
Machine Learning influences almost every field where data is available. Its ability to learn patterns,
make predictions, and improve decisions makes it a core technology of modern society.
Supervised Learning
Classification & Regression: Rule-Based and Instance-Based Learning
Supervised learning is a major class of machine learning where the model learns from labeled data.
Two important problem types under supervised learning are classification and regression, and two
important learning approaches are rule-based and instance-based learning.
1. Supervised Learning
Supervised learning is a learning process where the training dataset contains input features and
their corresponding correct outputs (labels).
👉 Goal: Learn a function
f(X)→Y
2. Classification and Regression
2.1 Classification
Output: Discrete / categorical
Assigns inputs to predefined classes
Examples
Email → Spam / Not Spam
Student → Pass / Fail
Disease → Yes / No
Common Algorithms
Decision Tree
Naïve Bayes
k-NN
SVM
2.2 Regression
Output: Continuous / numerical
Predicts numeric values
Examples
Student final score
House price
Temperature
Common Algorithms
Linear Regression
Polynomial Regression
k-NN Regression
3. Rule-Based Learning
Rule-based learning creates explicit IF–THEN rules from training data to make predictions.
Example Rule (Classification)
IF Attendance > 80% AND StudyHours > 3
THEN Result = Pass
How Rule-Based Learning Works
1. Analyze labeled training data
2. Extract decision rules
3. Apply rules to classify or predict new data
Algorithms Using Rule-Based Learning
Decision Trees
Rule Induction Algorithms (e.g., RIPPER, CN2)
Advantages
✔ Easy to understand and interpret
✔ Human-readable rules
Limitations
✖ Struggles with noisy data
✖ Rules may become complex
4. Instance-Based Learning (Lazy Learning)
Instance-based learning stores training instances and makes predictions by comparing new data
with stored examples.
Also called lazy learning because no explicit model is built during training.
Example (k-Nearest Neighbor – k-NN)
Store all student records
For a new student:
o Find the k most similar students
o Predict class/value based on neighbors
Distance Measure (Example)
Distance=(x1−y1)2+(x2−y2)2
Algorithms Using Instance-Based Learning
k-Nearest Neighbor (k-NN)
Case-Based Reasoning
Advantages
✔ Simple and flexible
✔ Adapts easily to new data
Limitations
✖ High memory usage
✖ slow prediction time
5. Rule-Based vs Instance-Based Learning
Feature Rule-Based Learning Instance-Based Learning
Model Explicit rules No explicit model
Training time Higher Very low
Prediction time Fast Slower
Interpretability High Low
Memory usage Low High
6. Classification & Regression with Learning Types
Learning Type Classification Regression
Rule-Based Decision Trees Regression Trees
Instance-Based k-NN Classification k-NN Regression
7. Educational Example
Problem: Predict student result
Classification: Pass / Fail
Regression: Final score
Rule-Based:
IF attendance ≥ 75% → Pass
Instance-Based:
Compare with similar past students
8. Summary (Exam-Oriented)
Supervised learning uses labeled data
Classification → categorical output
Regression → numerical output
Rule-based learning uses IF–THEN rules
Instance-based learning uses similarity between data point
Supervised Learning: Classification & Regression
Rule-Based and Instance-Based Learning
K-Nearest Neighbor, Decision Tree, Bayesian Classification, and Support Vector Machine
Supervised learning uses labeled data to learn a mapping between input features and output labels. It
mainly solves classification and regression problems using different learning approaches.
1. Supervised Learning
Supervised Learning is a machine learning approach where each training example consists of an
input and a known output (label). The model learns from these examples to predict outputs for new,
unseen data.
Example:
Predicting whether a student will Pass or Fail based on attendance, study hours, and assignment
scores.
2. Classification and Regression
2.1 Classification
Output: Discrete / categorical
Assigns data points to predefined classes
Example:
Email → Spam / Not Spam
Student → Pass / Fail
2.2 Regression
Output: Continuous / numerical
Predicts real-valued outputs
Example:
Student final score
House price
3. Rule-Based Learning
Rule-based learning generates explicit IF–THEN rules from training data to perform classification
or regression.
Example
IF Attendance ≥ 80% AND StudyHours ≥ 3
THEN Result = Pass
Characteristics
Human-readable rules
Knowledge is easy to interpret
Example Algorithm
Decision Tree (converted into rules)
4. Instance-Based Learning
Instance-based learning stores training data and makes predictions by comparing new instances with
stored examples using a similarity or distance measure.
📌 Also called lazy learning.
Example
A new student’s result is predicted by comparing them with similar past students.
Example Algorithm
k-Nearest Neighbor (k-NN)
5. k-Nearest Neighbor (k-NN)
K-Nearest Neighbor (k-NN) is an instance-based supervised learning algorithm that classifies or
predicts a value based on the k most similar data points.
How It Works
1. Choose a value for k
2. Compute distance between new data and all training data
3. Select k nearest neighbors
4. Predict:
o Majority class (classification)
o Average value (regression)
Example
If k = 3 and among nearest neighbors:
2 students passed
1 student failed
👉 Prediction = Pass
Advantages & Limitations
✔ Simple and effective
✖ Slow for large datasets
6. Decision Tree
A Decision Tree is a rule-based supervised learning algorithm that uses a tree structure to make
decisions based on feature values.
Example
Is Attendance ≥ 75%?
├─ Yes → Pass
└─ No → Fail
Characteristics
Easy to interpret
Can handle classification and regression
Example Use Case
Student performance prediction, medical diagnosis.
7. Bayesian-Based Classification
Bayesian Classification is a probabilistic approach based on Bayes’ Theorem to predict the class of
an instance.
Bayes’ Theorem
P(C∣X)=P(X∣C)P(C)P(X)
Naïve Bayes Classifier (Most Common)
Assumes features are independent.
Example
Classifying emails as spam:
P(Spam | “free”) is calculated using probabilities
Email classified into the class with the highest probability
Advantages & Limitations
✔ Fast and efficient
✖ Independence assumption may be unrealistic
8. Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised learning algorithm that finds an optimal
hyperplane that best separates data points of different classes.
Key Idea
Maximize the margin between classes
Uses support vectors (critical data points)
Example
Separating students into Pass and Fail groups using study hours and attendance.
Characteristics
Effective in high-dimensional spaces
Can perform classification and regression (SVR)
Advantages & Limitations
✔ High accuracy
✖ computationally expensive
9. Summary Table (Exam-Oriented)
Algorithm Learning Type Problem Type Example
Rule-Based Rule-based Classification IF–THEN rules
k-NN Instance-based Class/Reg Similar students
Decision Tree Rule-based Class/Reg Pass/Fail tree
Naïve Bayes Probabilistic Classification Spam detection
SVM Model-based Class/Reg Optimal separation
10. Conclusion
Supervised learning solves classification and regression
Rule-based learning is interpretable
Instance-based learning relies on similarity
k-NN, Decision Trees, Bayesian classifiers, and SVM are widely used supervised learning
algorithms