0% found this document useful (0 votes)
9 views59 pages

Concept Learnning1

Information theory focuses on efficient communication systems and encompasses data compression and error correction techniques, which are essential for machine learning and data analytics. It involves concepts like entropy to measure uncertainty in data, aiding in decision-making and model evaluation. Machine learning leverages these principles to improve performance through systematic data processing, feature selection, and model training.

Uploaded by

zenaaleme21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views59 pages

Concept Learnning1

Information theory focuses on efficient communication systems and encompasses data compression and error correction techniques, which are essential for machine learning and data analytics. It involves concepts like entropy to measure uncertainty in data, aiding in decision-making and model evaluation. Machine learning leverages these principles to improve performance through systematic data processing, feature selection, and model training.

Uploaded by

zenaaleme21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Information theory is a branch of science that deals with a communications system for efficient and

reliable transmission of information


Communication theory deal with systems for transmitting information form one point to another, that
is the discovery of the law of data compression (the entropy H) and transmission (the channel capacity
C)
A stream of symbols from the sender to the receiver is generally called a message, and if a source
emits independent signals then the source is called a memory-less source.
In the case of 'n' sequential forks, with 'm' final choices (destinations), then m= 2n words, n= log 2m.
Information is thus, an ordered symbol of sequences to interpret its meaning.
Information theory entails two broad techniques:
 Data Compression (source coding): More frequent events should have shorter encodings
 Error Correction (channel coding): Should be able to infer encoded event even if the message is
corrupted by noise.
Both these methods require to build probabilistic models of the data sources.
This is why information theory is relevant to machine learning and data analytics.
It helps to measure the accuracy of information in a data source and improves the results of predictive
models that might be built on this data, and it can apply as:
 Information Theory and Machine Learning Codewords
Information theory represents data in codewords in a sequence of binary digits or bits in various
techniques to map each symbol (data element) with the corresponding codeword.
 Variable Length Codes
Variable-length codes are encoding schemes that use codes of different lengths to represent symbols
in a message.
One traditional way of assigning codewords to data elements would be to assign codes with fixed
lengths, or every data element gets assigned a code of the same length.
Examples 1
If we have 4 values of a given variable, with the values of 'Promotion', 'Priority Customer', 'Repeat
buy', and 'None’, and have equal probable, they can safely be mapped to codewords of equal length,
say 00, 01, 10, and 11
Mean that, they do have Fixed Length Encoding, and every data element here is assumed to have an
equal likelihood of occurrence, and every code word has an equal and fixed length (i.e. 2)
Variable length coding is a technique used in data compression, where different symbols are encoded
with different lengths of binary code words.
 The goal of variable length coding is to assign shorter code words to more frequent symbols and
longer code words to less frequent symbols, to achieve a more efficient representation of the data.
 The expected length of code words in variable length coding is the average length of the code words
used to represent the symbols of the random variable.
Entropy is a measure of the uncertainty or randomness associated with a random variable.
 It quantifies the average amount of information required to specify an outcome of the random
variable, the higher the entropy, the more uncertain or random the variable is.
 So, Entropy is simply the average (expected) amount of the information from the event.
Does Entropy have range from 0 to 1?
 No. However, the range is set based on the number of outcomes.
 Equation for calculating the range of Entropy:
 0≤Entropy≤log(n), where n is number of outcomes
 Entropy 0(minimum entropy) occurs when one of the probabilities is 1 and rest are 0’s
 Entropy log(n)(maximum entropy) occurs when all the probabilities have equal values of 1/n.
 Information theory in statistical learning refers to tools for understanding data.
Understanding Entropy in Machine Learning in 4 Steps:
 Step 1: Entropy measures uncertainty or disorder in a dataset. In machine learning, it quantifies how
mixed or random the labels or classes in a dataset are.
 Step 2: calculate entropy to consider the distribution of class labels in the dataset. For a binary
classification problem, the formula is typically-p * log2(p)- (1-p) * log2(1-p), where 'p' represents the
proportion of one class in the dataset.
 Step 3: Information gain: Entropy is critical in decision tree algorithms. Information gain, a metric
used in decision trees, measures the reduction in entropy achieved by splitting a dataset on a particular
feature.
 Step 4: Entropy in Feature Selection and Model Evaluation: Entropy is used in feature selection and
model evaluation.
Features with high information gain are preferred for splitting datasets, as they contribute more to
reducing uncertainty.
Entropy is a foundational concept in machine learning that serves as a tool for quantifying uncertainty
and aiding decision-making.
By understanding how entropy works and its applications in various aspects of machine learning, we
can make better-informed choices in data preprocessing, feature selection, and model evaluation.
Information theory for machine learning is the process of knowledge discovery to improve the
performance of the machine learning models.
There are two modes of applying information theory in machine learning:
 Apply fundamental information measures and concepts like entropy, mutual information, and KL
divergence as objective functions or regularization terms in an optimization problem.
 Develop new algorithms and techniques using concepts from sophisticated information theory, such
as rate-distortion theory and coding theory
Learning programs
 A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.
 Learning is a fundamental aspect of human cognition and development, enabling individuals to
adapt to their environment, solve problems, make decisions, and improve their capabilities over time.
Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user biometrics)
 For example, a software agent interacting with the world makes observations, takes actions, and is
rewarded or punished; it should learn to choose actions in such a way as to obtain a lot of reward,
 Learning is a process of gaining knowledge or understanding of, or skill in by study, instruction, or
experience

Core Difference
Aspect Statistical Reasoning Probabilistic Reasoning

Based On Observed data Mathematical probability

Focus Inference from data Measuring uncertainty

Direction Data → Conclusion Model → Likelihood

Question Type “What does data tell us?” “How likely is this event?”

Used In Surveys, research, hypothesis testing AI, Bayesian models, decision-making

Proportion We don’t kno the proportion We know the proportion

 Statistical reasoning = Learning from data


 Probabilistic reasoning = Thinking about uncertainty

1. Simple probability. P(A). The probability that an event (say, A) will occur.
2. Joint probability. P(A and B). P(A ∩ B). The probability of events A and B occurring together.
3. Conditional probability. P(A|B), read "the probability of A given B." The probability that event
A will occur given event B has occurred.
4. The probability distribution for a discrete random variable is described with a probability
mass function (PMF).

5. If a random variable is continuous, then its probability distribution is called a Probability


Density Function (PDF)

Discrete vs Continuous
Discrete Variable Continuous Variable

Uses PMF Uses PDF


Discrete Variable Continuous Variable

P(X = x) > 0 possible P(X = x) = 0

Uses summation (Σ) Uses integration (∫)

The need of probability distribution:


 To calculate confidence intervals for parameters
 To calculate critical regions for hypothesis tests
 To calculate learning association
Machine Learning is the study of methods for programming computers to learn, and build machines
that automatically learn from experience
Machine Learning is the study of data-driven methods capable of imitating, understanding, and
aiding human and biological information-processing tasks
It aims to select, explore, and extract useful knowledge from complex, often non-linear data, building
a computational model capable of describing unknown patterns or correlations, and in turn, solving
challenging problems
 Therefore, the goal of machine learning is to build computer systems that can adapt and learn from
their experience.

Steps in Machine Learning


 Data collection “training data”, mostly with “labels” provided by a “teacher”;
 Data pre-processing clean data to have homogeneity
 Feature engineering Select representative features to improve performance
 Modeling choose the class of models that can describe the data
 Estimation/Selection find the model that best explains the data: simple and fits well;
 Validation evaluate the learned model and compare it to the solution found using other model
classes;
 Operation Apply the learned model to new “test” data or real-world instances

Machine learning is similar to data mining, but data mining is the science of discovering unknown
patterns and relationships in data; machine learning applies previously inferred knowledge to new
data to
 Make decisions in real-life applications
 Computers approximate complex functions from historical data
 Rules are not explicitly programmed but learned from data
Learning—A Two-Step Process
1. Model construction:
 A training set is used to create the model.
 The model is represented as classification rules, decision trees, or mathematical formula
2. Model usage:
 The test set is used to see how well it works for classifying future or unknown objects
Importance of Machine Learning
 Some tasks cannot be defined well, except by examples (e.g., recognizing people).
 Relationships and correlations can be hidden within large amounts of data. to find these
relationships . Machine Learning/Data Mining may be able
 Human designers often produce machines that do not work as well as desired in the environments in
which they are used.
 The amount of knowledge available about certain tasks might be too large for explicit encoding by
humans (e.g., medical diagnostic).
 Environments change over time.
 New knowledge about tasks is constantly being discovered by humans. It may be difficult to
continuously re design systems “by hand”
Area of influence for Machine Learning
 Statistics: How best to use samples drawn from unknown probability distributions to help decide
from which distribution some new sample is drawn?
 Brain Models: Non-linear elements with weighted inputs (Artificial Neural Networks) have been
suggested as simple models of biological neurons.
 Adaptive Control Theory: How to deal with controlling a process having unknown parameters that
must be estimated during operation?
 Psychology: How to model human performance on various learning tasks?
 Artificial Intelligence: How to write algorithms to acquire the knowledge humans are able to
acquire, at least, as well as humans?
 Evolutionary Models: How to model certain aspects of biological evolution to improve the
performance of computer programs

The Machine Learning framework consists of problem definition, data collection, preprocessing,
feature engineering, model selection, training, evaluation, and deployment. It ensures systematic
development of predictive models from raw data to final prediction.

Problem → Data → Preprocessing → Feature Engineering → Model → Training → Evaluation →


Deployment

A set of attributes used to describe a given object is called an attribute vector (or feature vector).
The distribution of data involving one attribute (or variable) is called univariate.
The type of an attribute is determined by the set of possible values the attribute can have. Attributes
can be nominal, binary, ordinal, or numeric.

A dataset is a complete collection of related data used for analysis or machine learning. A data object
is a single instance or record within that dataset.

Attribute types in ML
 Nominal categories, states, or “names of things”
Eg. Hair_color= {black, brown, blond, red, grey, white}
 Binary Nominal attribute with only 2states (0and 1)
Symmetric binary: both outcomes equally important  e.g. Gender
Asymmetric binary: outcomes not equally important.  e.g., medical test (positive vs.
negative)
 Ordinal Values have a meaningful order (ranking) but magnitude between successive values is
not known. Size ={small, medium, large}, grades,
 Numeric: quantity (integer or real-valued)
 Measured on a scale of equal-sized units
 Values have order E.g. Temperature in C˚or F˚, calendar dates Interval-scaled
 No true zero-point
 Ratio-Scaled  Inherent zero-point E.g. Temperature in Kelvin, length, counts,
 Discrete Attribute  Has only a finite or countable infinite set of values E.g. Zip codes,
profession,  Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
 Continuous Attribute Has real numbers as attribute values E.g. Temperature, height, or weight 
practically, real values can only be measured and represented using a finite number of digits
Continuous attributes are typically represented as floating-point variables (float, double, long
double)
Measure of central tendency
1. Mean
How to solve if the mean is sensitive to extreme values (outliers), we can use the median as a
robust measure of central tendency. Alternatively, we may remove outliers using statistical
methods or use a trimmed mean.
2. Median
3. Mode
4. Midrange
Measures of data dispersion
1. Standard deviation
2. Range
3. Five number summary min, Q 1, median, Q 3, max
Min=Q1-1.5xIQR, max=Q3+1.5xIQR
4. IQR=Q3-Q1
 Outlier: usually, a value falling at least1.5 x IQR above the third quartile or below the first
quartile
Basic properties of the standard deviation
  Measures spread about the mean and should be used only when the mean is chosen as the measure
of the center
 σ =0 only when there is no spread, that is, when all observations have the same value. Otherwise σ
>0
Proximity refers to a similarity or dissimilarity
Measures of Data Quality
Accuracy: How well does a piece of information reflect reality? [Correct/wrong]
Completeness: Does it fulfill your expectations of what’s comprehensive? [recorded/not]
Consistency: Does information stored in one place match relevant data stored elsewhere?
Timeliness: Is your information available when you need it?
Validity: Is information in a specific format, does it follow business rules?
Uniqueness: Is this the only instance in which this information appears in the dataset?
Why Data Preprocessing?
Data in the real world is dirty
 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers that deviate from the expected
 Inconsistent: containing discrepancies in codes or names: lack of compatibility (e.g Some attributes
representing a given concept may have different names in different databases)
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality data
 A multi-dimensional measure of data quality:
 A well-accepted multi-dimensional view: accuracy, completeness, consistency,
timeliness, believability, value added, interpretability, accessibility
 Broad categories: intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
 Data cleaning fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration is integration of multiple databases, data cubes, files, or notes
 Data transformation
 Normalization (scaling to a specific range)
 Aggregation.
 Data comparison
 Data reduction
 Obtains reduced representation in volume but produces the same or similar analytical
results
 Data discretization: with particular importance, especially for numerical data
 Data aggregation, dimensionality reduction, data compression, generalization
 Important for Big Data Analysis
 Data discretization (for numerical data)
 It refers to transferring the data sets which is continuous into discrete interval values
Data preparation, cleaning, and transformation comprise the majority of the work in a data mining
application (90%).
Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
How to Handle Missing Data
 Ignore the tuple: usually done when class label is missing classification—not effective in certain
cases)
 Fill in the missing value manually: tedious and infeasible (assuming the task is
 Use a global constant to fill in the missing value: E.g., “unknown”, a new class?! Simple but not
recommended as this constant may form some interesting pattern and mislead decision process
 Use the attribute mean: for all samples belonging to the same class to fill in the missing value with
the mean value of attributes
 Use the most probable value: fill in the missing values by predicting its value from correlation of
the available values
Except the first two approach, the rest filled values are incorrect and the last two are common
Data imbalance (class imbalance) happens when the number of samples in one class is much larger
than in another class in a dataset.
For example:
 Fraud detection: 99% non-fraud, 1% fraud
 Medical diagnosis: 95% healthy, 5% disease
Why is Data Imbalance a Problem?
1. The model becomes biased toward the majority class
2. Poor recall for minority class
3. Misleading accuracy score
4. Important events (fraud, disease, defects) get missed
How to solve data imbalance problem
1. Data-Level Methods (Resampling)
✅ A. Oversampling (Increase minority class)
 Random oversampling
 SMOTE (Synthetic Minority Over-sampling Technique) – generates synthetic samples
 ADASYN
✔ Good when dataset is small
❌ Risk of overfitting (if simple duplication)
✅ B. Undersampling (Reduce majority class)
 Random undersampling
 Tomek links
 NearMiss
✔ Faster training
❌ May lose useful information
2. Algorithm-Level Methods
✅ A. Use Class Weights
Most ML models support weighting:
 Logistic Regression
 Decision Trees
 Random Forest
 Neural Networks
This penalizes mistakes on minority class more.
✅ B. Use Specialized Algorithms
 Balanced Random Forest
 XGBoost with scale_pos_weight
 LightGBM with is_unbalance=True
3. Evaluation Strategy Fix
Never rely on accuracy alone.
Use:
 Precision
 Recall
 F1-score
 ROC-AUC
 PR-AUC (better for heavy imbalance)
Confusion matrix is very important.
When to Use What?
Situation Best Method
Small dataset SMOTE
Huge dataset Under sampling
Deep learning Class weights
Severe imbalance (1:1000) Combine oversampling + class weighting
Fraud/medical tasks Focus on Recall + PR-AUC
Best Practice (Recommended)
1. Start with class weights
2. Try SMOTE
3. Evaluate using F1 / PR-AUC
4. Tune decision threshold (very important!)
Feature Selection is a process that chooses an optimal subset of features according to a certain
criterion.
Why we need Feature Selection (FS)?
 To improve performance (in terms of speed, predictive power, simplicity of the model).
 To visualize the data for model selection.
 To reduce dimensionality and remove noise.
Supervised learning is the prevalent method for constructing predictors from data,
Learning a function that maps an input to an output based on example input-output pairs.
Supervised learning algorithms are broadly categorized into classification and regression,
 Classification problem, if the target variable is categorical,
Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True
or False, Spam or Not Spam, etc.
 Regression problem, if the target variable is continuous
Regression algorithms are used to predict continuous values such as price, salary, age, etc.
You can estimate the selling price of a living house in a given city.
Classification is a two-step process: a model construction (learning) phase and a model usage
(applying) phase.
In ML, single classification assigns each instance to one of several mutually exclusive classes, while
multi-label classification allows an instance to belong to multiple classes simultaneously
Approaches to solve multi-label classification problem
1. Problem transformation: Transforms a multi-label classification problem to multiple classes
2. Single-label classification problems, which can be called adapting data to the algorithm.
 Binary Relevance (BR): Multi-label into single-label problems.
Advantages
Computationally efficient
Disadvantages
Does not capture the dependence relations among the class variables
 Label Power-set (LP) : Multi-label into single multi-class– consider each label set as a class.
Advantages
Learns the full joint of the class variables and each of the new class values maps to a label
combination
Disadvantages
The number of choices in the new class can be exponential (|YLP| = O(2d)) and Learning a multi-class
classifier on exponential choices is expensive
 Classifier Chains (CC): resolves the BR limitations by making a label correlation task.
Limitation: The result can vary for different order of chains Solution: ensemble
3. Adapted Algorithm: Perform multi-label classification, rather than transforming the problem
into different subsets of the problem.
4. Ensemble Approaches: learning multiple classifier systems trains multiple hypotheses to solve
the same problem.
5. Multi-label problems can essentially be broken down into sets of binary problems without much
loss of information,
Regression methods
Regression analysis is the process of estimating a functional relationship between X and Y.
A regression equation is often used to predict a value of Y for a given value of X.
Another way to study the relationship between two variables is correlation. It involves measuring the
direction and the strength of the linear relationship
Logistic regression estimates the probability of an event occurring, such as voting or not voting,
based on a given data set of independent variables.
It is a supervised ML algorithm widely used for binary classification tasks, such as identifying
whether an email is spam or not.
Binary logistic regression is for a dichotomous criterion (i.e., 2-level variable)
The other type of logistic regression is Multinomial logistic regression which is for a multi
categorical criterion (i.e., a variable with more than 2 levels)
KNN
 The k-Nearest Neighbors (KNN) family of classification and regression algorithms is often
referred to as memory-based learning or instance-based learning.
Sometimes, it is also called lazy learning.
 It is a non-parametric, supervised learning classifier, which uses proximity to make classifications
or predictions about the grouping of individual data
 K-NN does not build a model from the training data.
 The nearest-neighbor is that the properties of any particular input X are likely to be similar to those
of points in the neighborhood of X
How to choose the value of k for the KNN Algorithm?
The value of k in KNN decides how many neighbors the algorithm looks at when making a prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can make the predictions more stable.
But if k is too large, the model may become too simple and miss important patterns, and this is
underfitting.
So, k should be picked carefully based on the data.
Three points are required to deal with KNN
 The set of stored records
 Distance Metric to compute the distance between records
 The value of k, the number of nearest neighbors to retrieve
The most commonly used distance function is the Euclidean distance

D(x, y)=
K-nearest neighbor’s algorithm steps
 Step1. Determine the parameter k number of nearest neighbors
 Step2. Calculate the distance between the query instance and all the training samples
 Step3. Sort the distance and determine nearest neighbors based on the k-th minimum distance
 [Link] categoryYof the nearest neighbor
 Step 5. Use the simple majority of the category of nearest neighbors as the prediction value of the
query instance
Decision Tree for classification
Decision tree structure
 Root node: beginning of a tree and represents entire population being analyzed.
It has no incoming edges and zero or more outgoing edges
 Internal node: denotes a test on an attribute
Each of which has exactly one incoming edge and two or more outgoing edges
 Branch: represents an outcome of the test
 Leaf nodes: represent class labels or class distribution
Each of which has exactly one incoming and no outgoing edges.
The algorithm for the decision tree involves three general phases as stated below:
 Phase I: Find Splitting Criteria based on all the sample set at the splitting point (node) (attribute
selection measure)
 Phase II: Split all the sample data based on the splitting criteria and form branches and each
successor nodes will have the samples
 Phase III: Do phase one and phase two iteratively until the stopping criterion is fulfilled
Attribute selection measures
 Information theory
 Information Gain (ID3 algorithm)
 Gain Ratio (C4.5 algorithm)
 Gini Index (CART algorithm)
 Information theory provides a mathematical basis for measuring the information content.
Information Gain All attributes are assumed to be categorical
Stopping Criteria
Some conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning
 There are no samples left in the branching direction
 Some user-defined stopping criterion met, such as
Depth threshold
The number of samples in the node becomes below some value
Strength
 In practice: One of the most popular methods. Why?
 Very comprehensible– the tree structure specifies the entire decision structure
 Easy for decision makers to understand model’s rational
 Relatively easy to implement
 Very fast to run (to classify examples) with large data sets
 Splits are simple and not necessarily optimal
 Overfitting (may not have the same performance when deployed, complex and more specific
conditions)
 Underfitting (may over generalize during training, less conditions)
Weakness:
 Overfitting means that the model performs poorly on new examples (e.g. testing examples) as it is
too highly trained to the specific (non-general) nuances of the training examples
 Underfitting means that the model performs poorly on new examples as it is too simplistic to
distinguish between them (i.e. has not picked up the important patterns from the training examples)
Bayesian Classification
P(h|X) =  P(h|X)= P(x| h) P(h) /p(X)
Where h is hypothesis, X is a training data
Compared to Decision tree, Bayesian classifiers have also exhibited high accuracy and speed when
applied to large databases
The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis
 This greatly reduces the computation cost:-only counts the class distribution
SVM
Support Vector Machine (SVM) is a relatively new class of successful supervised learning methods
for classification and regression
It is linear and non-linear functions, and they have an efficient training algorithm
It is a method of Support Vector Machines, and a clever way to prevent overfitting.
This number of data points is called the Vapnik-Chervonenkis dimension.
The model does not need to shatter all sets of data points of size h. One set is sufficient.
 SVMs are linear classifiers that find a hyperplane to separate two classes of data,
In the 2D space, the decision boundary is a line, whereas in the 3D space, the decision boundary is
nonlinear or a hyperplane
 Our aim is to find such a hyperplane f(x) = sign(w•x+b), that correctly classify our data.
A good separation is achieved by the hyperplane that has the largest distance to the neighboring data
points of both classes.
 The vectors (points) that constrain the width of the margin are the support vectors.
Misclassification may happen
Non-Separable Data:
Slack Variables
The slack variable measures how far a data point is from the margin or on the incorrect side of the
hyperplane.
The kernel trick is a technique in ML that allows you to implicitly map data into a higher-
dimensional space without explicitly performing the transformation.
This method is particularly useful in scenarios where the transformation to a higher dimensional space
is computationally expensive or even impossible to perform directly.
A kernel function is some function that corresponds to an inner product in some expanded feature
space.
kernel trick The SVM kernel is a function that takes low dimensional input space and transforms it to
a higher dimensional space i.e. it converts not separable problem to separable problem
Kernel function Φ: map data into a different space to enable linear separation.
Convert linear learning machines (such as linear SVM) into non-linear ones (such as kernel SVM) by
doing an inner product between data points.
 Kernel functions are very powerful. They allow SVM models to perform separations even with very
complex boundaries.
Why use kernels?
 Make non-separable problem separable.
 Map data into better representational space
Kernel helps to find a hyperplane in the higher dimensional space without increasing the
computational cost
Strengths
 Training is relatively easy
No local optimal, unlike in neural networks
 It scales relatively well to high dimensional data
 Tradeoff between classifier complexity and error can be controlled explicitly
 Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
Weaknesses
 Need to choose a “good” kernel function.
ANN
Artificial Neural Networks (ANNs) are Neural Networks (NN) inspired by biological neural networks.
 A neural network is composed of several processing elements called neurons that are connected,
and their output depends on the strength of the connections.
Learning involves adapting weight factors that represent connection strength.
In general
Neural networks (NN) or artificial neural networks (ANN)
 The resulting model from neural computing is called an artificial neural network (ANN) or neural
network
 NN represents a brain metaphor for information processing
 Computer technology that attempts to build computers that will operate like a human brain. The
machines possess simultaneous memory storage and work with ambiguous information
• Each input has an associated weight w, which can be modified to model synaptic learning.
Properties of ANN
 Learning from examples labeled or unlabeled
 Additivity changing the connection strengths to learn things
 Non-linearity the non-linear activation functions are essential
 Fault tolerance If one of the neurons or connections is damaged, the whole network still works quite
well
Thus, they might be better alternatives than classical solutions for problems characterized by: high
dimensionality, noisy, imprecise or imperfect data; and a lack of a clearly stated mathematical solution
or algorithm
ANNs model
 Neural Network learns by adjusting the weights so as to be able to correctly classify the training
data and hence, after testing phase, to classify unknown data.
 Neural Network needs long time for training.
 Neural Network has a high tolerance to noisy and incomplete data
Bias can be incorporated as another weight clamped to a fixed input of +1.0
 This extra free variable (bias) makes the neuron more powerful
All data must be normalized.
 (I.e. all values of attributes in the database are changed to contain values in the internal [0, 1] or [-1,
1]) Neural Network can work with data in the range of (0, 1) or (-1, 1)
 Two basic data normalization techniques
1. Max-Min normalization
Rescales data into a fixed range using minimum and maximum values
2. Decimal Scaling normalization
Normalizes data by dividing by a power of 10 so that values fall within (-1,1).

Comparison Table
Feature Min–Max Decimal Scaling
Formula (x - min)/(max-min) x / 10j
Output Range Usually [0,1] (-1,1)
Needs min/max? Yes No
Sensitive to outliers? Yes Yes
Common in ML? Very common Rare
One of the most popular neural network models is the multi-layer perceptron (MLP). Output layer
 In an MLP, neurons are arranged in layers. There is one input layer, one output layer, and several
(or many) hidden layers.
Performance evaluation
 It is
 How to obtain a reliable estimate of performance?
 Performance of a model may depend on other factors besides the learning algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets
 Metrics for Performance Evaluation
 How to evaluate the performance of a model?
 Methods for Performance Evaluation
 How to obtain reliable estimates?
 Methods for Model Comparison
 How to compare the relative performance among competing models?
Absolute and Mean Square Error
Refers to the error committed to classify an object to the desired class, which is the difference
between the desired value and the predicted value

 For all these measures, smaller values usually indicate a better fitting model.
Accuracy

Accuracy is a measure of how well a binary classifier correctly identifies or excludes a condition.
It’s the proportion of correct predictions among the total number of cases examined.
 It assumes equal cost for all classes
 Misleading in unbalanced datasets
 It doesn’t differentiate between different types of errors
 Example
 Cancer Dataset: 10000 instances, 9990 are normal, 10 are ill , If our model classified all
 Medical diagnosis: 95 % healthy, 5% disease.
 E-Commerce: 99 % don’t buy, 1 % buy.
 Security: 99.999 % of citizens are not terrorists.
Limitation of accuracy
 Consider a 2-class problem
 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10 instances as normal accuracy will be 99.9 %
 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %
 Accuracy is misleading because model does not detect any class of above example
Binary classification Confusion Matrix
Confusion matrix is The notion of a confusion matrix can be usefully extended to the multiclass case
(i,j) cell indicate how many of the i-labeled examples were predicted to be j

• There are four quadrants in the confusion matrix, which are symbolized as below.
• True Positive (TP) : The number of instances that were positive (+) and correctly classified as
positive (+).
• False Negative (FN): The number of instances that were positive (+) and incorrectly classified as
negative (-).
• False Positive (FP): The number of instances that were negative (-) and incorrectly classified as (+).
• True Negative (TN): The number of instances that were negative (-) and correctly classified as (-).
 Given a dataset of P positive instances and N negative instances:
TN rate= TN/TN+FP
FN rate= FN/FN+TP
Error rate=1-success rate
Accuracy=Success rate and Loss=error rate

 Of all patients where we classified 𝑦 = 1, what fraction actually has the disease?
Precision (How much we are precise in the detection)
Recall (also called Sensitivity in some fields) measures the proportion of actual positives which are
correctly identified as such (e.g. the percentage of sick people who are identified as having the
condition);
Recall (How much we are good at detecting)
 Of all patients that actually have the disease, what fraction did we correctly detect as having the
disease?
 Sensitivity: Measures the classifier ability to detect positive classes (its positivity)

Step 1: Start With Confusion Matrix (Multiclass)


Suppose we have 3 classes: A, B, C
Example Confusion Matrix:
Actual \ Predicted A B C
A 40 5 5
B 3 30 7
C 2 6 32
Total samples:
40+5+5+3+30+7+2+6+32=130
✅ 1 Accuracy (Multiclass)
Accuracy = Correct predictions / Total samples
Correct predictions = diagonal sum
40+30+32=102
Accuracy=102/130
Accuracy=0.7846≈78.46%
✅ 2 Precision (Multiclass)
For multiclass, we calculate precision for each class separately.
Treat one class as "positive" and others as "negative".
🔹 Precision for Class A
Precision = TP / (TP + FP)
For class A:
TP(A) = 40
FP(A) = Predicted A but not actual A
= 3 (from B) + 2 (from C) = 5
Precision A=40/40+5 =0.8889
🔹 Precision for Class B
TP(B) = 30
FP(B) = 5 (from A) + 6 (from C) = 11
Precision B=30/30+11=0.7317
🔹 Precision for Class C
TP(C) = 32
FP(C) = 5 (from A) + 7 (from B) = 12
Precision C=32/32+12=3244=0.7273
✅ 3 Recall (Multiclass)
Recall = TP / (TP + FN)
🔹 Recall for Class A
FN(A) = 5 + 5 = 10
Recall A=40/40+10 =0.80
🔹 Recall for Class B
FN(B) = 3 + 7 = 10
Recall B=30/40=0.75
🔹 Recall for Class C
FN(C) = 2 + 6 = 8
Recall C=32/40=0.80
✅ 4 F1-Score (Per Class)
F1=2⋅Precision⋅Recall/Precision + Recall
🔹 F1 for Class A
F1A=2(0.8889)(0.80)/0.8889+0.80 =0.842
🔹 F1 for Class B
=0.740= 0.740=0.740
🔹 F1 for Class C
=0.761= 0.761=0.761
🎯 Final Step: Averaging (Important for Exams)
There are 3 types of averaging:
1 Macro Average
Simple average across classes.
Precision macro=PA+PB+PC/3
Same for Recall and F1.
All classes treated equally.
2 Weighted Average
Weighted by number of samples in each class.
Used when dataset is imbalanced.
3 Micro Average
Calculate global TP, FP, FN first.
In multiclass:
Micro Precision=Micro Recall=Micro F1=Accuracy
(when using single-label classification)
🧠 Important Exam Notes
Accuracy
Good when classes are balanced.
Precision
Important when:
 False positives are costly
Example:
Spam detection
Recall
Important when:
 Missing positive cases is costly
Example:
Disease detection
F1
Used when:
 Need balance between precision and recall
 Data is imbalanced
🎓 Final Exam Summary Formula Sheet
For class i:
Precisioni=TPi/TPi+FPi
Recalli=TPi/TPi+FNi
F1i=2PrecisioniRecalli/Precisioni+Recalli
Accuracy=∑diagonal/Total
1 What is Cross-Validation?
Cross-validation (CV) is a technique used to evaluate the performance of a machine learning model
by splitting the data into multiple training and testing sets.
It helps estimate how well the model will perform on unseen data.
Why Do We Need It?
If we:
 Train and test on same data → overfitting
 Use one random split → unstable result
Cross-validation:
 Uses data efficiently
 Reduces variance
 Gives more reliable performance estimate
✅ 2 k-Fold Cross-Validation (Most Important)
Steps:
1. Split dataset into k equal folds
2. Choose 1 fold as test set
3. Use remaining (k−1) folds as training set
4. Train model
5. Test model
6. Repeat k times
7. Average performance
Example: 5-Fold CV
Dataset = 100 samples
k=5
Each fold = 20 samples
Accuracy=Acc1+Acc2+Acc3+Acc4+Acc5/5
3 Special Types of Cross-Validation
1. Stratified k-Fold
Used for classification.
Ensures each fold has same class distribution as original dataset.
Very important when dataset is imbalanced.
2. Leave-One-Out Cross-Validation (LOOCV)
Special case of k-fold:
k=nk = nk=n
Each iteration:
 Train on n-1 samples
 Test on 1 sample
Advantage:
 Uses almost all data for training
Disadvantage:
 Very slow for large datasets
3. Repeated k-Fold
Repeat k-fold multiple times with different splits.
More stable estimate.
✅ 4 How Performance is Calculated
For each fold:
 Compute Accuracy
 Precision
 Recall
 F1
 MSE (for regression)
Final performance:
Mean=1/k∑i=1kScorei
Sometimes also compute:
Standard Deviation
To measure stability.
✅ 5 When to Use Cross-Validation
✔ Small dataset
✔ Model selection
✔ Hyperparameter tuning
✔ Comparing models
✅ 6 Cross-Validation vs Train-Test Split
Train-Test Split Cross-Validation
Single split Multiple splits
Faster Slower
Less reliable More reliable
More variance Less variance

✅ 7 Cross-Validation for Model Selection


Used to choose:
 Best k in KNN
 Best depth in Decision Tree
 Best C in SVM
 Best learning rate in ANN
Procedure:
1. Try different parameter values
2. Use k-fold CV for each
3. Choose best average performance
✅ 8 Bias-Variance Perspective
 Small k (e.g., 5) → Higher bias, lower variance
 Large k (e.g., 10) → Lower bias, higher variance
 LOOCV → Very low bias, high variance
Common choice:
k=5 or 10
✅ 9 Advantages
✔ Efficient use of data
✔ Reliable estimate
✔ Reduces overfitting risk
✔ Helps hyperparameter tuning
❌ Disadvantages
✖ Computationally expensive
✖ Slow for large datasets
✖ Not ideal for time-series (need special method)
✅ 🔟 Special Case: Time Series
Use:
 Rolling validation
 Forward chaining
Do NOT shuffle time-series data.
🎯 Common Exam Questions
1. Define cross-validation.
2. Explain k-fold cross-validation.
3. What is stratified k-fold?
4. Difference between LOOCV and k-fold.
5. Why use cross-validation?
6. Bias-variance tradeoff in k selection.
7. Cross-validation vs bootstrap.
🎓 5-Mark Answer Template
Cross-validation is a model evaluation technique where the dataset is divided into k folds. The model
is trained on k−1 folds and tested on the remaining fold. This process is repeated k times and the final
performance is the average of all folds. It provides a more reliable estimate of model performance
compared to a single train-test split.
1 What is ROC?
ROC = Receiver Operating Characteristic curve
It is a graph that shows the performance of a binary classifier at different classification thresholds.
It plots:
TPR (True Positive Rate)
vs
FPR (False Positive Rate)

2 Important Definitions
From confusion matrix:
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
✅ True Positive Rate (TPR)
Also called Recall or Sensitivity
TPR=TP/TP+FN
✅ False Positive Rate (FPR)
FPR=FP/FP+TN
3 What Does ROC Curve Show?
 X-axis → FPR
 Y-axis → TPR
Each point = different decision threshold
Threshold Effect
If threshold is very low:
 Almost everything predicted positive
 TPR high
 FPR high
If threshold is very high:
 Almost everything predicted negative
 TPR low
 FPR low
ROC shows tradeoff between TPR and FPR.
✅ 4 What is AUC?
AUC = Area Under the ROC Curve
It measures overall model performance.
0≤AUC≤1
5 Geometric Meaning of AUC
AUC = probability that:
A randomly chosen positive sample
is ranked higher than
a randomly chosen negative sample.
✅ 6 How to Calculate AUC (Simple Example)
Suppose ROC points are:
FPR TPR
0 0
0.2 0.6
0.5 0.8
1 1
We compute area using trapezoidal rule.
Example first trapezoid:
Area=(0.2−0)×(0+0.6)/2
Add all segments.
Total area = AUC.
✅ 7 Why Use AUC?
✔ Evaluates ranking quality
✔ Independent of classification threshold
✔ Good for imbalanced datasets
✔ Measures discrimination ability
❌ Limitations
✖ Only for binary classification (basic ROC)
✖ Does not reflect actual probability calibration
✖ Can be misleading in highly skewed data
✅ 8 ROC vs Accuracy
Accuracy ROC-AUC
Depends on threshold Independent of threshold
Accuracy ROC-AUC
Bad for imbalanced data Better for imbalanced data
Measures exact predictions Measures ranking ability
✅ 9 When to Use ROC-AUC
Use when:
 Binary classification
 Class imbalance
 Need threshold-independent evaluation
 Comparing classifiers
Examples:
 Disease detection
 Fraud detection
 Spam filtering
✅ ROC for Multiclass
Use:
1. One-vs-Rest (OvR)
2. Micro-average AUC
3. Macro-average AUC
Key Exam Differences
ROC vs Precision-Recall Curve
 ROC better when classes balanced
 PR curve better when positive class is rare
5-Mark Exam Definition
ROC curve plots True Positive Rate against False Positive Rate at various thresholds. The Area Under
the Curve (AUC) summarizes overall classification performance, where a higher AUC indicates better
model discrimination ability.
Quick Memory Trick
AUC ≈ How good is the model at ranking positives above negatives?

Unsupervised Learning
For unsupervised learning, we observe only the features X The goal is then to predict Y using X 1 , X
2 ,…, Xn, 1 , 1 , X 2 ,…, Xn. and we are not interested in prediction, because we do not have an
associated response variable Y .
Unsupervised learning aims to find the underlying structure or the distribution of data, so that we can
explore the data to find some intrinsic structures in it.
 It means that the data have no target attribute, but we want to find some intrinsic structures in them
that can be used to cluster the input data in classes based on their statistical properties only.
Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the
analysis, such as prediction of a response.
However, techniques for unsupervised learning are of growing importance in a number of fields:
 Subgroups of breast cancer patients grouped by their gene expression measurements,
 Groups of shoppers characterized by their browsing and purchase histories,
 Movies grouped by the ratings assigned by movie viewers. It is often easier to obtain unlabeled
data from a lab instrument or a computer than labeled data, which can require human
intervention.
Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a data set that
determine the intrinsic grouping in a set of unlabeled data
It makes this concrete; we must define what it means for two or more observations to be similar (near)
or different (far away). Similar to one another within the same cluster and Dissimilar to the objects in
other clusters using clustering analysis
It is represented by a single point, known as the centroid (or cluster center) of the cluster.
 Centroid is computed as the mean of all data points in a cluster = Cj=∑ x𝑖
 Cluster boundary is decided by the farthest data point in the cluster
Example1: groups people of similar sizes together to make “small”, “medium”, and “large” T-Shirts.
Example2: In marketing, segment customers according to their similarities  To do targeted
marketing.
 Example 3: Given a collection of text documents, we want to organize them according to their
content similarities,  To produce atopic hierarchy
Clustering quality
Inter-clusters distance  maximized
Intra-clusters distance  minimized
The quality of a clustering result depends on the algorithm, the distance function, and the application.
How is clustering subjective? It is a technique for finding similarity groups in data

Atypical cluster analysis consists of four steps


Procedures of Cluster Analysis
1. Feature extraction is the process of using one or more transformations of the input features to
generate new principal features.
It can be elaborated in the context of dimensionality reduction and data visualization.
2. Clustering algorithm design or selection
 The clustering step is usually combined with the selection of a corresponding proximity (i.e,
the closeness or distance) measure and
 The construction of a clustering criterion function (i.e, finding the optimal partitioning of a
data set according to some criterion function or algorithm).
 Proximity measures refer to the measures of Similarity and Dissimilarity.
 Clustering criterion: once a proximity measure is chosen, the construction of a
clustering criterion function makes the partition of clusters as an optimization problem
3. Cluster validation (Assessment of results)
 Given a dataset, any clustering algorithm can usually generate clusters, no matter whether the
structure exists or not.
 Thus, effective evaluation standards and criteria are essential to provide the users with a degree
of confidence in the clustering results.
 These evaluation methods should be objective and have no preference for any algorithm.
4. Results Interpretation
 The final target of clustering is to supply users with meaningful perceptions from the original
dataset, with the aim that they can effectively solve the problems faced.
 Experts in different domains interpret the data groupings.
 Further analyses, even experiments, may be required to ensure the reliability of extracted
knowledge. ,
Clustering results are crucially dependent on the measure of similarity (or distance) between “points”
to be clustered
Cluster analysts are usually interested in both shape and level differences among groups.
It aims to explore and analyze patterns from data samples and divide them into broad groups.
Application of Cluster Analysis
 General application of clustering
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters and explain them in spatial data mining
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns
Clustering Algorithms
 Type of clustering
 Exclusive Clustering: K-means (Partitioning clustering)
 Iteratively re-assign points to the nearest cluster center to partition the observations into a
pre-specified number of clusters.
Given k, the k-means algorithm works as follows:
 Randomly choose k data points (seeds) to be the initial centroids, cluster centers
 Assign each data point to the closest centroid
 Re-compute the centroids using the current cluster memberships.
 If a convergence criterion is not met, go to 2).
Given k, the k-means algorithm is implemented in 4 steps:
1. Select K centroid (Can be K values randomly, or K data points randomly)
2. Partition objects into k subsets. An object will be clustered into class J if it has the smallest
distance with this class mean compared to the distance from the other class mean
3. Compute the new centroids of the clusters of the current partition. The centroid of the jth
cluster is the center (mean point) of the data points whose cluster index is found to be the
center of class j in the above step.
4. Go back to Step 3, stop when the process converges.
Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of
clusters, and t is the number of iterations.
Since both k and t are small. K-means is considered a linear algorithm.
K-means is the most popular clustering algorithm.
Note that it terminates at a local optimum if SSE is used. The global optimum is hard to find due to
complexity.
Weakness
 The algorithm is only applicable if the mean is defined.
 For categorical data, k-mode- the centroid is represented by most frequent values.
 The user needs to specify k.
 The algorithm is sensitive to outliers
 Outliers are data points that are very far away from other data points. Outliers could be
errors in the data recording or some special data points with very different values.
Common way to represent clustering
 Use the centroid of each cluster to represent the cluster.
 Compute the radius and
 standard deviation of the cluster to determine its spread in each dimension
 The centroid representation alone works well if the clusters are of a hyper-spherical
shape.
 If clusters are elongated or are of other shapes, centroids are not sufficient
 Using a classified model
 All the data points in a cluster are regarded as having the same class label, e.g., the
cluster ID.
 run a supervised learning algorithm on the data to find a classification model.
 Use frequent values to represent a cluster
 This method is mainly for clustering of categorical data (e.g., k-modes clustering).
 The main method used in text clustering is where a small set of frequent words in each
cluster is selected to represent the cluster.
 Clusters of arbitrary shapes
 Hyper-elliptical and hyper-spherical clusters are usually easy to represent, using their
centroid together with spreads.
 Irregular shape clusters are hard to represent. They may not be useful in some applications.
• Using centroids is not suitable (upper figure) in general
 K-means clusters may be more useful (lower figure), e.g., for making 2 size T-shirts.
 Hierarchical Clustering: Agglomerative clustering, divisive clustering
 Start with each point as its own cluster and iteratively merge the closest clusters,
 We do not know in advance how many clusters we want; in fact, we end up with a tree-
like visual representation of the observations, called a dendrogram that allows us to view
at once the clustering obtained for each possible number of clusters, from 1 to n.
Hierarchical clustering is an alternative approach that does not require that we commit to a particular
choice of K. It is the most common type of hierarchical clustering, and refers to the fact that a
dendrogram is built starting from the leaves and combining clusters up to the trunk. Adendrogram is a
nested sequence of clusters (a tree clustering)
Types of hierarchical clustering
 Agglomerative (bottom-up) clustering: It builds the dendrogram (tree) from the bottom level,
and
 Merges the most similar (or nearest) pair of clusters
 Stops when all the data points are merged into a single cluster (i.e., the root cluster).
 Agglomerative clustering is more popular than divisive methods.
 At the beginning, each data point forms a cluster (also called a node).
 Merge nodes/clusters that have the least distance.
 Go on merging
 Eventually, all nodes belong to one cluster
 Divisive (top-down) clustering: It starts with all data points in one cluster, the root.
 Splits the root into a set of child clusters. Each child cluster is recursively divided further
 Stops when only singleton clusters of individual data points remain, i.e., each cluster
with only a single point
Complexity of hierarchical clustering
 Distance matrix is used for deciding which clusters to merge/split
 Atleast quadratic in the number of data points
 Notusable for large datasets
Strength
 No assumptions on the number of clusters
 Any desired number of clusters can be obtained by ‘cutting’the dendrogram at the
proper level
 Hierarchical clustering may correspond to meaningful taxonomies
 Example in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product
catalogs) etc
 Mean-shift clustering
 Estimate modes of pdf
 Spectral clustering
 Split the nodes in a graph based on assigned links with similarity weights
 Overlapping Clustering: Fuzzy C-means
 Probabilistic Clustering: Mixture of Gaussian models
Fuzzy C-means clustering
Fuzzy c-means clustering is a soft clustering approach, where each data point is assigned a likelihood
or probability score belonging to that cluster.
One data point may belong to two or more clusters with different memberships
The algorithm depends on a parameter m which corresponds to the degree of fuzziness of the solution.
Measuring the distance of two clusters
A few ways to measure distances of two clusters.
Results in different variations of the algorithm.
 Single link minimum distance between any objects
The distance between two clusters is the distance between two closest data points in the two clusters,
one data point from each cluster.
Strengths of single-link clustering
 Can handle non-elliptical shapes
Limitations of single-link clustering
 Sensitive to noise and outliers
 It produces long, elongated clusters
 Complete link maximum distance between any objects
The distance is defined by the two most dissimilar objects
Strengths of complete-link clustering
 More balanced clusters (with equal diameter)
 Less susceptible to noise
Limitations of complete-link clustering
 Tends to break large clusters
 All clusters tend to have the same diameter small clusters are merged with larger ones
 Average link
It is the sensitivity of complete-link clustering to outliers
The distance between two clusters is the average distance of all pair-wise distances between the data
points in two clusters.
Strength
 Less susceptible to noise and outliers
Limitation
 Biased towards globular clusters
 Centroids
The distance between two clusters is the distance between their centroids
Ward’s distance for clusters
 Similar to group average and centroid distance
 Less susceptible to noise and outliers
 Biased towards globular clusters
 Hierarchical analogue of k-means
Can be used to initialize k-means

Distance functions
Distance functions for numeric attributes
Most commonly used functions are
 Euclidean distance and
 Manhattan (city block) distance
Distance functions
Distance functions for binary and nominal attributes
Binary attribute: has two values or states but no ordering relationships,
e.g., Gender: male and female.

Information Theory and Statistical Learning (Machine Learning)


Information Theory and Statistical Learning provide the mathematical foundations for many
machine learning algorithms. Information theory explains

How information is measured, while statistical learning explains how models learn from data.
1. Information Theory
Information Theory studies the quantification, storage, and transmission of information. In ML,
it is used to measure uncertainty, impurity, and information gain.
3. Key Concepts in Information Theory

2.1 Information
Information is the reduction in uncertainty after observing an event.
If an event is rare, it carries more information.
2.2 Entropy
Entropy measures the uncertainty or impurity in a dataset.
H(S)=−∑pilog2pi
Where:
 pi = probability of class i
Example
If a dataset has:
 50% Pass
 50% Fail
H=−(0.5log20.5+0.5log⁡20.5)=1
👉 Maximum uncertainty
2.3 Information Gain
Information Gain measures the reduction in entropy after splitting data on an attribute.
IG(S,A)=H(S)−∑∣Sv∣∣S∣H(Sv)
Used in Decision Trees to select the best feature.
2.4 Mutual Information
Measures how much information one variable provides about another.
I(X;Y)=H(X)−H(X∣Y)
Used in:
 Feature selection
 Dependency analysis
3. Role of Information Theory in ML
Concept ML Application
Entropy Decision tree splitting
Information Gain Feature selection
Mutual Information Dependency measurement
KL-Divergence Model comparison

4. Statistical Learning
Statistical Learning is the framework for understanding how models learn patterns from data using
statistical principles.
5. Key Concepts in Statistical Learning
5.1 Population vs Sample
 Population: Entire dataset
 Sample: Subset used for learning
5.2 Model
A mathematical function that maps inputs to outputs.
y=f(x)+ϵ
Where:
 ϵ = noise
5.3 Empirical Risk Minimization (ERM)
Learning by minimizing average loss on training data.
f^=arg⁡min⁡1n∑L(yi,f(xi))
5.4 Bias–Variance Tradeoff
 Bias: Error due to oversimplification
 Variance: Error due to sensitivity to data
Goal: Balance both.
5.5 Overfitting and Underfitting
 Overfitting: Model fits noise
 Underfitting: Model too simple
6. Role of Statistical Learning in ML
Concept Application
Regression Prediction
Classification Decision making
Hypothesis testing Model validation
Regularization Overfitting control

7. Relationship between Information Theory and Statistical Learning


Information Theory Statistical Learning
Measures uncertainty Learns from data
Focuses on entropy Focuses on risk minimization
Guides feature selection Guides model selection
📌 Example:
Decision Trees use entropy (information theory) and training data (statistical learning) together.
8. Simple Example (Student Dataset)
 Information Theory:
Which feature (attendance, study hours) gives the highest information gain?
 Statistical Learning:
How well does the model generalize to new students?
9. Summary (Exam-Oriented)
 Information theory quantifies information and uncertainty
 Entropy and information gain are core concepts
 Statistical learning focuses on learning models from data
 Both together form the foundation of ML algorithms
Data and Data Preprocessing in Machine Learning
Data is the foundation of Machine Learning. The quality of data and how it is prepared directly
affect the performance of ML models. Data preprocessing is the step where raw data is transformed
into a clean, usable format.
1. Data
Data is a collection of raw facts, observations, or measurements that can be processed to extract
meaningful information.
Types of Data
1.1 Structured Data
 Organized in rows and columns
 Stored in databases or spreadsheets
Examples:
Student records, exam scores, attendance
1.2 Unstructured Data
 No fixed format
Examples:
Text documents, emails, images, videos
1.3 Semi-Structured Data
 Partial structure
Examples:
XML, JSON files, log files
Data in Machine Learning
In ML, data usually consists of:
 Features (attributes / inputs)
 Labels (outputs) (in supervised learning)
Example:
Study Hours Attendance Score
3 85% 78

2. Data Preprocessing
Data preprocessing is the process of cleaning, transforming, and organizing raw data into a suitable
format for machine learning models.
📌 “Garbage in → Garbage out”
Good preprocessing leads to better models.
3. Steps in Data Preprocessing
3.1 Data Cleaning
Removes or fixes incorrect, incomplete, or inconsistent data.
a) Handling Missing Values
Methods:
 Remove rows/columns
 Replace with mean, median, or mode
 Use prediction methods
Example:
Missing exam score → replace with class average
b) Handling Noise
Noise = random errors or outliers
Methods:
 Smoothing
 Outlier detection
 Binning
c) Handling Duplicate Data
 Remove repeated records
3.2 Data Integration
Combines data from multiple sources.
Example:
Student academic records + attendance system
3.3 Data Transformation
Converts data into a suitable format.
a) Normalization / Scaling
Ensures features are on the same scale.
Example Methods:
 Min–Max Scaling
 Z-score Normalization
b) Encoding Categorical Data
Converts non-numeric data into numbers.
Examples:
 Label Encoding
 One-Hot Encoding
Gender: Male → 1, Female → 0
c) Feature Construction
Creating new features from existing ones.
Example:
Total marks = test1 + test2 + final
3.4 Data Reduction
Reduces data size while preserving important information.
Methods:
 Feature selection
 Dimensionality reduction (PCA)
 Sampling
3.5 Data Discretization
Converts continuous values into categories.
Example:
Score → Low, Medium, High
3.6 Data Splitting
Divide data into:
 Training set (70–80%)
 Testing set (20–30%)
Sometimes also:
 Validation set
4. Importance of Data Preprocessing
✔ Improves model accuracy
✔ Reduces training time
✔ Avoids bias and errors
✔ Makes data ML-ready
5. Example (Student Performance Dataset)
Raw Data:
Study Hours = 3, Attendance = ?, Score = 75
After Preprocessing:
Study Hours = 3, Attendance = 80, Score = 75
6. Summary (Exam-Oriented)
 Data is raw information used in ML
 Preprocessing prepares data for learning
 Key steps: cleaning, transformation, reduction
 Good preprocessing improves ML performance
Concept Learning in Machine Learning
Concept Learning is a fundamental topic in Machine Learning (ML) that deals with learning general
concepts or patterns from specific examples. It is one of the earliest and simplest forms of
supervised learning.
Concept Learning is the task of inferring a Boolean-valued function from labeled training
examples.
 A concept is a rule or function that classifies objects as positive (True) or negative (False).
 Example:
“A day is a good day for playing football if the weather is sunny and temperature is
moderate.”
2. Key Components of Concept Learning
2.1 Instance Space (X)
The set of all possible examples.
Example:

 Sky ∈ {Sunny, Rainy, Cloudy}


Each day described by attributes:

 Temperature ∈ {Hot, Mild, Cold}


 Humidity ∈ {High, Normal}
 Wind ∈ {Strong, Weak}
Each combination forms an instance.
2.2 Concept (C)
A subset of the instance space that satisfies certain conditions.
Example Concept:
Days suitable for playing football.
2.3 Hypothesis Space (H)
The set of all possible hypotheses the learner can choose from.
Example Hypothesis:
Sky = Sunny AND Temperature = Mild
2.4 Target Concept
The true concept we want the learning algorithm to discover.
3. Training Examples
Each training example consists of:
 An instance
 A label:
o Positive (+) → belongs to the concept
o Negative (–) → does not belong to the concept
Sky Temp Humidity Wind Play?
Sunny Mild Normal Weak Yes
Rainy Cold High Strong No

To find a hypothesis h ∈ H that:


4. Goal of Concept Learning

 Correctly classifies all training examples


 Approximates the target concept as closely as possible
5. General-to-Specific Ordering
Hypotheses can be ordered by generality:
 Most General Hypothesis:
Accepts all instances
 < ?, ?, ?, ? >
 Most Specific Hypothesis:
Accepts no instances
 < Ø, Ø, Ø, Ø >
6. Concept Learning Algorithms
6.1 Find-S Algorithm
 Starts with the most specific hypothesis
 Generalizes it only when needed
 Considers only positive examples
Limitation:
Ignores negative examples ❌
6.2 Candidate Elimination Algorithm
 Maintains Version Space
 Uses both positive and negative examples
 Maintains:
o S: Set of most specific hypotheses
o G: Set of most general hypotheses
Advantage:
More accurate and robust than Find-S ✅
7. Version Space

VS(H, D) = { h ∈ H | h is consistent with D }


The Version Space is the set of all hypotheses consistent with the training data.

8. Example (Simple)
If all positive examples are:
 Sunny
 Mild temperature
Then the learned concept might be:
Sky = Sunny AND Temperature = Mild
9. Applications of Concept Learning
 Email spam classification
 Medical diagnosis (disease vs no disease)
 Student performance prediction
 Fault detection systems
10. Advantages and Limitations
Advantages
✔ Simple and interpretable
✔ Good for teaching ML fundamentals
Limitations
✖ Works mainly with Boolean concepts
✖ Sensitive to noise
✖ Not scalable for large datasets
11. Summary
 Concept learning is about learning rules from examples
 It is a form of supervised learning
 Key ideas: instance space, hypothesis space, version space
 Algorithms: Find-S and Candidate Elimination
 Forms the foundation for more advanced ML techniques
Probabilistic vs Statistical Reasoning in Machine Learning
In Machine Learning (ML), probabilistic reasoning and statistical reasoning are closely related but
serve different purposes. Understanding the distinction is important for exams, assignments, and real-
world ML system design.
1. Overview
Aspect Probabilistic Reasoning Statistical Reasoning
Focus Handling uncertainty Learning from data samples
Main Question What is the likelihood of an event? What can we infer from data?
Basis Probability theory Statistics
Output Probability distributions Estimates, models, tests

2. Probabilistic Reasoning
Probabilistic reasoning uses probability theory to represent and reason under uncertainty. It
models uncertainty explicitly using probabilities.
2.2 Key Concepts
 Random variables
 Prior probability
 Conditional probability
 Bayes’ Theorem
 Joint and marginal distributions
Bayes’ Theorem:
P(H∣D)=P(D∣H)P(H)P(D)
2.3 Example
Spam Classification
 P(Spam) = 0.3
 P("free" | Spam) = 0.7
 P("free" | Not Spam) = 0.1
Compute:
P(Spam∣"free")P(Spam|"free")P(Spam∣"free")
The model reasons probabilistically to decide whether an email is spam.
2.4 ML Algorithms Using Probabilistic Reasoning
 Naïve Bayes Classifier
 Bayesian Networks
 Hidden Markov Models (HMM)
 Probabilistic Graphical Models
2.5 Strengths and Limitations
Strengths
✔ Explicit uncertainty modeling
✔ Works well with missing data
✔ Strong theoretical foundation
Limitations
✖ Requires correct probability assumptions
✖ Computationally expensive for large models
3. Statistical Reasoning
Statistical reasoning focuses on drawing conclusions from data, often using samples to make
inferences about a population.
3.2 Key Concepts
 Sampling
 Estimation (mean, variance)
 Hypothesis testing
 Confidence intervals
 Regression analysis
3.3 Example
Student Performance Prediction
Using past student scores:
 Compute average score
 Fit a regression model
 Test whether study hours significantly affect performance
This is statistical inference.
3.4 ML Algorithms Using Statistical Reasoning
 Linear Regression
 Logistic Regression
 k-Nearest Neighbors (k-NN)
 Support Vector Machines (SVM)
3.5 Strengths and Limitations
Strengths
✔ Data-driven
✔ Scalable to large datasets
✔ widely used in practice
Limitations
✖ Assumes data represent the population
✖ often does not model uncertainty explicitly
4. Key Differences (Exam-Oriented)
Feature Probabilistic Statistical
Uncertainty Explicitly modeled Often implicit
Prior knowledge Uses priors Rarely uses priors
Output Probability of outcomes Parameter estimates
Decision making Bayesian Frequentist
5. Relationship between the Two
 Probabilistic reasoning is about modeling uncertainty
 Statistical reasoning is about learning parameters from data
 Modern ML often combines both
📌 Example:
Bayesian Linear Regression
 Probability → Model uncertainty
 Statistics → Estimate parameters
6. Simple Analogy
 Probabilistic reasoning:
“There is a 70% chance it will rain tomorrow.”
 Statistical reasoning:
“Based on 10 years of data, rainfall increases in July.
7. Summary
 Probabilistic reasoning answers “how likely?”
 Statistical reasoning answers “what can we infer from data?”
 Both are essential for ML
 Bayesian ML bridges the two approaches
How machine learning works
1. Basic Idea
👉 Data → Learning → Prediction/Decision
Instead of writing rules by hand, we:
1. Give the machine data
2. The machine learns patterns
3. It uses those patterns to predict or decide on new data
2. Main Steps of Machine Learning
Step 1: Data Collection
Data is gathered from different sources.
Examples:
 Student scores and attendance
 Emails (spam or not spam)
 Images and text
 Sensor or log data
Step 2: Data Preparation (Preprocessing)
Raw data is cleaned and prepared.
Includes:
 Removing missing or incorrect values
 Encoding text or categories into numbers
 Normalizing or scaling data
 Splitting data into:
o Training set
o Testing set
Step 3: Choosing a Model
A model is a mathematical representation of patterns.
Examples of models:
 Linear Regression
 Decision Tree
 Naïve Bayes
 Neural Networks
Step 4: Training the Model
The model learns by:
 Making predictions on training data
 Comparing predictions with actual answers
 Adjusting parameters to reduce error
This process is called optimization.
Step 5: Evaluation
The trained model is tested on unseen data.
Evaluation metrics:
 Accuracy
 Precision
 Recall
 Mean Squared Error (MSE)
Step 6: Prediction / Deployment
The model is used in real life to:
 Predict outcomes
 Classify new data
 Support decision-making
3. Types of Machine Learning
3.1 Supervised Learning
Uses labeled data (input + correct output).
Examples:
 Student performance prediction
 Spam detection
Algorithms:
 Linear Regression
 Logistic Regression
 SVM
 k-NN
3.2 Unsupervised Learning
Uses unlabeled data.
Examples:
 Customer clustering
 Anomaly detection
Algorithms:
 K-Means
 Hierarchical Clustering
 PCA
3.3 Reinforcement Learning
Learns by trial and error using rewards.
Examples:
 Game playing (Chess, Go)
 Robotics
4. Simple Example (Student Performance)
1. Input data:
o Study hours
o Attendance
o Assignment scores
2. Output:
o Final grade
3. Model learns the relationship:
More study hours → Higher grade
4. Predicts grade for a new student.
5. Why Machine Learning Works
✔ Large amount of data
✔ Powerful algorithms
✔ Increased computing power
6. Real-World Applications
 Healthcare (disease prediction)
 Education (student performance analysis)
 Banking (fraud detection)
 Security (intrusion detection)
 Recommendation systems (YouTube, Netflix)
7. Limitations of Machine Learning
 Needs quality data
 Can be biased
 Not always explainable
 Requires expertise
8. Summary
 Machine Learning lets systems learn from data
 Works through data → model → learning → prediction
 Used in many real-world applications
 Foundation of Artificial Intelligence (AI)
Traditional Programming vs Machine Learning Approach
Traditional Programming and Machine Learning (ML) are two different ways of solving problems
using computers. The key difference lies in how rules and decisions are created.
1. Basic Idea
Traditional Programming
👉 Rules + Data → Output
 Humans write explicit rules.
 The computer follows those rules exactly.
Machine Learning
👉 Data + Output → Rules (Model)
 The machine learns rules automatically from data.
 Humans provide data and learning algorithms.
2. Working Process Comparison
Aspect Traditional Programming Machine Learning
Rule creation Written by programmer Learned from data
Flexibility Low High
Adaptability Needs reprogramming Learns automatically
Handling complexity Difficult Efficient
Data dependency Low High

3. Example: Spam Email Detection


Traditional Programming
Rules written manually:
 If email contains “free” → spam
 If email has many links → spam
❌ Hard to maintain
❌ Fails for new spam patterns
Machine Learning Approach
 Input: Thousands of labeled emails
 ML model learns patterns automatically
 Classifies new emails as spam or not spam
✅ Adapts to new spam
✅ Higher accuracy
4. Flow Diagram (Conceptual)
Traditional Programming
Data + Rules → Program → Output
Machine Learning
Training Data + Answers → ML Algorithm → Model
New Data → Model → Output
5. Key Differences (Exam-Oriented Table)
Feature Traditional Programming Machine Learning
Decision logic Fixed Dynamic
Performance improvement Manual Automatic (with more data)
Suitable for Well-defined problems Complex & data-driven problems
Example tasks Calculator, payroll Face recognition, prediction

6. Advantages and Limitations


Traditional Programming
Advantages
 Simple and predictable
 Easy to debug
 No training data needed
Limitations
 Not scalable for complex problems
 Cannot learn or improve
Machine Learning
Advantages
 Learns from experience
 Handles large and complex data
 Improves over time
Limitations
 Requires large datasets
 Can be hard to explain
 Needs computational resources
7. When to Use Which?
✔ Use Traditional Programming when:
 Rules are clear and stable
 Logic is simple (e.g., tax calculation)
✔ Use Machine Learning when:
 Rules are unknown or complex
 Patterns must be learned from data
 Problem involves prediction or classification
8. Summary
 Traditional programming depends on human-written rules
 ML learns rules automatically from data
 ML is better for complex, real-world problems
 Both approaches are important and often combined
Classes of Machine Learning Problems
Machine Learning (ML) problems are commonly classified based on how learning is done and the
type of output produced. Understanding these classes is essential for exams and practical ML
applications.
1. Main Classes of Machine Learning Problems
1.1 Supervised Learning
The model learns from labeled data (input + known output).
Key Characteristics
 Data has correct answers
 Model learns input–output mapping
 Used for prediction
Types
a) Classification
 Output is a category or class
Examples:
 Spam vs Not Spam
 Pass vs Fail
 Disease: Yes / No
Algorithms:
 Decision Tree
 Naïve Bayes
 SVM
 k-NN
b) Regression
 Output is a continuous value
Examples:
 Student final score
 House price prediction
 Temperature forecasting
Algorithms:
 Linear Regression
 Polynomial Regression
 SVR
1.2 Unsupervised Learning
The model learns from unlabeled data.
Key Characteristics
 No predefined output
 Discovers hidden patterns
Types
a) Clustering
 Groups similar data points
Examples:
 Student grouping by performance
 Customer segmentation
Algorithms:
 K-Means
 Hierarchical Clustering
 DBSCAN
b) Association Rule Learning
 Finds relationships between variables
Examples:
 Market basket analysis
 Course selection patterns
Algorithms:
 Apriori
 FP-Growth
1.3 Semi-Supervised Learning
Uses small labeled data + large unlabeled data.
Examples
 Image classification with few labeled images
 Text categorization
Algorithms
 Self-training
 Co-training
1.4 Reinforcement Learning
Learns through interaction with environment using rewards and penalties.
Key Components
 Agent
 Environment
 Actions
 Rewards
Examples
 Game playing (Chess, Go)
 Robot navigation
 Traffic signal control
Algorithms
 Q-Learning
 SARSA
 Deep Q Networks (DQN)
2. Classification Based on Output Type
Output Type ML Problem Class Example
Categorical Classification Spam detection
Numerical Regression Score prediction
Groups Clustering Student grouping
Output Type ML Problem Class Example
Patterns Association Course combinations
Sequential actions Reinforcement Game AI
3. Classification Based on Learning Method
Learning Method Description
Batch Learning Trained on full dataset at once
Online Learning Learns incrementally
Instance-Based Learns from stored examples
Model-Based Learns a generalized model

4. Real-Life Example (Education)


Problem ML Class
Predict student grade Regression
Identify weak students Classification
Group students by skill Clustering
Recommend courses Association
Adaptive learning system Reinforcement

5. Summary
 ML problems are mainly:
o Supervised
o Unsupervised
o Semi-supervised
o Reinforcement
 Each class solves different types of problems
 Choosing the right class is crucial for model success
Areas of Influence of Machine Learning
Machine Learning (ML) has a wide area of influence across many sectors because it enables systems
to learn from data, make predictions, and improve over time. Below are the major domains where
ML plays a significant role, explained clearly and with examples (exam-oriented).
1. Education 🎓
ML improves teaching, learning, and administration.
Applications
 Student performance prediction
 Dropout risk analysis
 Personalized learning systems
 Automated grading
Example:
Predicting at-risk students in private colleges using historical academic data.
2. Healthcare 🏥
ML assists in diagnosis, treatment, and patient management.
Applications
 Disease prediction (diabetes, cancer)
 Medical image analysis (X-ray, MRI)
 Drug discovery
 Patient monitoring systems
3. Banking and Finance 💰
ML helps in risk management and automation.
Applications
 Fraud detection
 Credit scoring
 Stock price prediction
 Customer behavior analysis
4. Business and Marketing 📊
ML enhances decision-making and customer engagement.
Applications
 Sales forecasting
 Recommendation systems
 Customer segmentation
 Demand prediction
Example:
Product recommendations on Amazon or Netflix.
5. Cyber security and Networking 🔐
ML detects and prevents security threats.
Applications
 Intrusion Detection Systems (IDS)
 Malware detection
 Anomaly detection in networks
 Spam filtering
6. Agriculture 🌾
ML supports smart farming and food security.
Applications
 Crop yield prediction
 Disease detection in plants
 Weather forecasting
 Smart irrigation systems
7. Transportation and Smart Cities 🚦
ML improves efficiency and safety.
Applications
 Traffic prediction
 Self-driving vehicles
 Route optimization
 Smart parking systems
8. Manufacturing and Industry 🏭
ML enhances productivity and quality.
Applications
 Predictive maintenance
 Defect detection
 Supply chain optimization
 Robotics automation
9. Natural Language Processing (NLP)
ML enables machines to understand human language.
Applications
 Speech recognition
 Machine translation
 Chatbots and virtual assistants
 Sentiment analysis
10. Computer Vision
ML allows machines to interpret images and videos.
Applications
 Face recognition
 Surveillance systems
 Object detection
 Medical image diagnosis
11. E-Commerce and Retail 🛒
ML improves user experience and operations.
Applications
 Price optimization
 Recommendation engines
 Customer churn prediction
 Inventory management
12. Government and Public Services
ML supports policy and service delivery.
Applications
 Crime prediction
 Tax fraud detection
 Population analysis
 Disaster management
13. Entertainment and Media 🎬
ML personalizes content and improves production.
Applications
 Music and video recommendation
 Game AI
 Content moderation
14. Research and Science 🔬
ML accelerates discovery and innovation.
Applications
 Climate modeling
 Astronomical data analysis
 Scientific simulations
15. Summary Table (Exam-Ready)
Area Influence of ML
Education Performance prediction, personalization
Healthcare Diagnosis, imaging
Area Influence of ML
Finance Fraud detection, forecasting
Security Intrusion & anomaly detection
Agriculture Yield & disease prediction
Transport Autonomous systems
Industry Predictive maintenance
NLP & Vision Language & image understanding
16. Conclusion
Machine Learning influences almost every field where data is available. Its ability to learn patterns,
make predictions, and improve decisions makes it a core technology of modern society.
Supervised Learning
Classification & Regression: Rule-Based and Instance-Based Learning
Supervised learning is a major class of machine learning where the model learns from labeled data.
Two important problem types under supervised learning are classification and regression, and two
important learning approaches are rule-based and instance-based learning.
1. Supervised Learning
Supervised learning is a learning process where the training dataset contains input features and
their corresponding correct outputs (labels).
👉 Goal: Learn a function
f(X)→Y
2. Classification and Regression
2.1 Classification
 Output: Discrete / categorical
 Assigns inputs to predefined classes
Examples
 Email → Spam / Not Spam
 Student → Pass / Fail
 Disease → Yes / No
Common Algorithms
 Decision Tree
 Naïve Bayes
 k-NN
 SVM
2.2 Regression
 Output: Continuous / numerical
 Predicts numeric values
Examples
 Student final score
 House price
 Temperature
Common Algorithms
 Linear Regression
 Polynomial Regression
 k-NN Regression
3. Rule-Based Learning
Rule-based learning creates explicit IF–THEN rules from training data to make predictions.
Example Rule (Classification)
IF Attendance > 80% AND StudyHours > 3
THEN Result = Pass
How Rule-Based Learning Works
1. Analyze labeled training data
2. Extract decision rules
3. Apply rules to classify or predict new data
Algorithms Using Rule-Based Learning
 Decision Trees
 Rule Induction Algorithms (e.g., RIPPER, CN2)
Advantages
✔ Easy to understand and interpret
✔ Human-readable rules
Limitations
✖ Struggles with noisy data
✖ Rules may become complex
4. Instance-Based Learning (Lazy Learning)
Instance-based learning stores training instances and makes predictions by comparing new data
with stored examples.
Also called lazy learning because no explicit model is built during training.
Example (k-Nearest Neighbor – k-NN)
 Store all student records
 For a new student:
o Find the k most similar students
o Predict class/value based on neighbors
Distance Measure (Example)
Distance=(x1−y1)2+(x2−y2)2
Algorithms Using Instance-Based Learning
 k-Nearest Neighbor (k-NN)
 Case-Based Reasoning
Advantages
✔ Simple and flexible
✔ Adapts easily to new data
Limitations
✖ High memory usage
✖ slow prediction time
5. Rule-Based vs Instance-Based Learning
Feature Rule-Based Learning Instance-Based Learning
Model Explicit rules No explicit model
Training time Higher Very low
Prediction time Fast Slower
Interpretability High Low
Memory usage Low High
6. Classification & Regression with Learning Types
Learning Type Classification Regression
Rule-Based Decision Trees Regression Trees
Instance-Based k-NN Classification k-NN Regression

7. Educational Example
Problem: Predict student result
 Classification: Pass / Fail
 Regression: Final score
 Rule-Based:
IF attendance ≥ 75% → Pass
 Instance-Based:
Compare with similar past students
8. Summary (Exam-Oriented)
 Supervised learning uses labeled data
 Classification → categorical output
 Regression → numerical output
 Rule-based learning uses IF–THEN rules
 Instance-based learning uses similarity between data point
Supervised Learning: Classification & Regression
Rule-Based and Instance-Based Learning
K-Nearest Neighbor, Decision Tree, Bayesian Classification, and Support Vector Machine
Supervised learning uses labeled data to learn a mapping between input features and output labels. It
mainly solves classification and regression problems using different learning approaches.
1. Supervised Learning
Supervised Learning is a machine learning approach where each training example consists of an
input and a known output (label). The model learns from these examples to predict outputs for new,
unseen data.
Example:
Predicting whether a student will Pass or Fail based on attendance, study hours, and assignment
scores.
2. Classification and Regression
2.1 Classification
 Output: Discrete / categorical
 Assigns data points to predefined classes
Example:
 Email → Spam / Not Spam
 Student → Pass / Fail
2.2 Regression
 Output: Continuous / numerical
 Predicts real-valued outputs
Example:
 Student final score
 House price
3. Rule-Based Learning
Rule-based learning generates explicit IF–THEN rules from training data to perform classification
or regression.
Example
IF Attendance ≥ 80% AND StudyHours ≥ 3
THEN Result = Pass
Characteristics
 Human-readable rules
 Knowledge is easy to interpret
Example Algorithm
 Decision Tree (converted into rules)
4. Instance-Based Learning
Instance-based learning stores training data and makes predictions by comparing new instances with
stored examples using a similarity or distance measure.
📌 Also called lazy learning.
Example
A new student’s result is predicted by comparing them with similar past students.
Example Algorithm
 k-Nearest Neighbor (k-NN)
5. k-Nearest Neighbor (k-NN)
K-Nearest Neighbor (k-NN) is an instance-based supervised learning algorithm that classifies or
predicts a value based on the k most similar data points.
How It Works
1. Choose a value for k
2. Compute distance between new data and all training data
3. Select k nearest neighbors
4. Predict:
o Majority class (classification)
o Average value (regression)
Example
If k = 3 and among nearest neighbors:
 2 students passed
 1 student failed
👉 Prediction = Pass
Advantages & Limitations
✔ Simple and effective
✖ Slow for large datasets
6. Decision Tree
A Decision Tree is a rule-based supervised learning algorithm that uses a tree structure to make
decisions based on feature values.
Example
Is Attendance ≥ 75%?
├─ Yes → Pass
└─ No → Fail
Characteristics
 Easy to interpret
 Can handle classification and regression
Example Use Case
Student performance prediction, medical diagnosis.
7. Bayesian-Based Classification
Bayesian Classification is a probabilistic approach based on Bayes’ Theorem to predict the class of
an instance.
Bayes’ Theorem
P(C∣X)=P(X∣C)P(C)P(X)
Naïve Bayes Classifier (Most Common)
Assumes features are independent.
Example
Classifying emails as spam:
 P(Spam | “free”) is calculated using probabilities
 Email classified into the class with the highest probability
Advantages & Limitations
✔ Fast and efficient
✖ Independence assumption may be unrealistic
8. Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised learning algorithm that finds an optimal
hyperplane that best separates data points of different classes.
Key Idea
 Maximize the margin between classes
 Uses support vectors (critical data points)
Example
Separating students into Pass and Fail groups using study hours and attendance.
Characteristics
 Effective in high-dimensional spaces
 Can perform classification and regression (SVR)
Advantages & Limitations
✔ High accuracy
✖ computationally expensive
9. Summary Table (Exam-Oriented)
Algorithm Learning Type Problem Type Example
Rule-Based Rule-based Classification IF–THEN rules
k-NN Instance-based Class/Reg Similar students
Decision Tree Rule-based Class/Reg Pass/Fail tree
Naïve Bayes Probabilistic Classification Spam detection
SVM Model-based Class/Reg Optimal separation

10. Conclusion
 Supervised learning solves classification and regression
 Rule-based learning is interpretable
 Instance-based learning relies on similarity
 k-NN, Decision Trees, Bayesian classifiers, and SVM are widely used supervised learning
algorithms

You might also like