Machine Learning Models
Machine Learning Models
Akashdeep
Assistant Professor, CSE
UIET, Panjab University, Chandigarh
• Descriptive Model
A Model is said to be descriptive if the model output does not involve the target
variable.
Its focus is to Identify interesting structures in the data.
1
Machine Learning Settings
Often, predictive models are learned in a supervised setting while descriptive models are obtained by unsupervised learning
methods but its not hard classification.
- examples of supervised learning of descriptive models (e.g., subgroup discovery which aims at identifying regions with an
unusual class distribution)
-unsupervised learning of predictive models (e.g., predictive clustering where the identified clusters are interpreted as classes).
Overview of different machine learning settings. The rows refer to whether the training data is labelled with a
target variable, while the columns indicate whether the models learned are used to predict a target variable or
rather describe the given data.
Probabilistic Model
Probabailistic models incorporate random variables and probability distributions into the model
of an event or phenomenon.
The approach is to assume that there is some underlying random process that generates the
values for variables, according to a well-defined but unknown probability distribution. We want
to use the data to find out more about this distribution.
Logical Model
Logical models make use of rules to iteratively partition the instance space into segments.
These models are more algorithmic in nature.
2
Machine Learning Models – Geometric Model
Geometric Model
A geometric model is constructed directly in instance space, using geometric concepts such as
lines, planes and distances. Usually, the set of instances has some geometric structure. For
example, if all features are numerical, then we can use each feature as a coordinate in a
cartesian coordinate system.
One main advantage of geometric classifiers is that they are easy to visualize, as long as we
keep to two or three dimensions.
It is important to keep in mind, though, that a Cartesian instance space has as many
coordinates as there are features, which can be tens, hundreds, thousands, or even more. Such
high-dimensional spaces are hard to imagine but are nevertheless very common in machine
learning.
Geometric concepts that potentially apply to high-dimensional spaces are usually prefixed with
‘hyper-’: for instance, a decision boundary in an unspecified number of dimensions is called a
hyperplane.
A good way to think of the vector w is as pointing from the ‘centre of mass’ of the negative
examples, n, to the centre of mass of the positives p.
In other words, w is proportional (or equal) to (p−n). One way to calculate these centres of mass
is by averaging.
By setting the decision threshold appropriately, we can intersect the line from n to p half-way.
We call this as the basic linear classifier.
3
Machine Learning Models – Geometric Model
Geometric Model – Basic Linear Classifier
4
Simple Classifier
• A very useful geometric concept in machine learning is the notion of
distance.
• If the distance between two instances is small then the instances are
similar in terms of their feature values, and so nearby instances would be
expected to receive the same classification or belong to the same cluster.
• In a Cartesian coordinate system, distance can be measured by Euclidean
distance, which is the square root of the sum of the squared distances
along each coordinate: di =(sum(i=1:d(xi − yi )2))1/2
• very simple distance based classifier :
• to classify a new instance, we retrieve from memory the most similar training
instance (i.e., the training instance with smallest Euclidean distance from the
instance to be classified), and
• simply assign that training instance’s class.
• This classifier is known as the nearest-neighbour classifier.
5
Machine Learning Models – Probabilistic Model
Probabilistic Model
An example posterior distribution.
‘Viagra’ and ‘lottery’ are two
Boolean features; Y is the class
variable, with values ‘spam’ and
‘ham’. In each row the most likely
class is indicated in bold.
For a particular e-mail we know the feature values and so we might write P(Y |Viagra =1, lottery = 0) if the e-mail
contains the word ‘Viagra’ but not the word ‘lottery’. This is called a posterior probability because it is used after the
features X are observed.
Assuming that X and Y are the only variables we know and care about, the posterior distribution P(Y |X) helps us to
answer many questions of interest.
- For instance, to classify a new e-mail we determine whether the words ‘Viagra’ and ‘lottery’ occur in it, look up the
corresponding probability P(Y = spam|Viagra, lottery), and predict spam if this probability exceeds 0.5 and ham
otherwise.
- This approach to predict a value of Y on the basis of the values of X and the posterior distribution P(Y |X) is called a
decision rule.
Suppose we skimmed an e-mail and noticed that it contains the word ‘lottery’ but we haven’t looked
closely enough to determine whether it uses the word ‘Viagra’. This means that we don’t know
whether to use the second or the fourth row in Table to make a prediction. This is a problem, as we
would predict spam if the e-mail contained the word ‘Viagra’ (second row) and ham if it didn’t (fourth
row). The solution is to average these two rows, using the probability of ‘Viagra’ occurring in any e-
mail (spam or not):
P(Y |lottery) =P(Y |Viagra = 0, lottery) P(Viagra = 0) + P(Y |Viagra = 1, lottery)P(Viagra = 1)
For instance, suppose for the sake of argument that one in ten e-mails contain the word ‘Viagra’, then
P(Viagra = 1) = 0.10 and P(Viagra = 0) = 0.90.
Using the above formula, we obtain P(Y = spam|lottery = 1) = 0.65·0.90+0.40·0.10 = 0.625 and P(Y =
ham|lottery = 1) = 0.35 · 0.90+0.60 · 0.10 = 0.375.
Because the occurrence of ‘Viagra’ in any e-mail is relatively rare, the resulting distribution deviates
only a little from the second row in Table.
6
Machine Learning Models – Probabilistic Model
Bayes’ Rule
Now, if we assume a uniform prior distribution (i.e., P(Y ) the same for every value of Y ) this reduces to the
maximum likelihood (ML) decision rule:
A useful rule of thumb is: use likelihoods if you want to ignore the prior distribution or assume it uniform,
and posterior probabilities otherwise.
7
Machine Learning Models – Probabilistic Model
8
• It is clear from the above analysis that the likelihood function plays an
important role in statistical machine learning. It establishes what is
called a generative model: a probabilistic model from which we can
sample values of all variables involved.
https://towardsdatascience.com/probability-concepts-explained-
maximum-likelihood-estimation-c7b4342fdbb1
9
• We now have some idea what a probabilistic model looks like, but how do we
learn such a model?
• In many cases this will be a matter of estimating the model parameters from
data, which is usually achieved by straightforward counting.
• For example, in the coin toss model of spam recognition we had two coins for
every word wi in our vocabulary one of which is to be tossed if we are generating
a spam e-mail and the other for ham e-mails.
• Let’s say that the spam coin comes up heads with probability and the ham coin
with probability θ
In order to estimate the parameters θ ± i we need a training set of e-mails labelled spam or ham.
We take the spam e-mails and count how many of them wi occurs in: dividing by the total number of spam e-
mails gives us an estimate of θ ⊕ i .
Repeating this for the ham e-mails results in an estimate of θi . And that’s all there is to it
10
Machine Learning Models - Types
Logical Model
• Logical models make use of rules to iteratively partition the instance space into segments. These
models are more algorithmic in nature.
• The rules can be easily organized in a tree structure, called feature tree.
• The idea of such a tree is that features are used to iteratively partition the instance space. The
leaves of the tree therefore correspond to rectangular areas in the instance space (or hyper-
rectangles, more generally) which we will call instance space segments, or segments for short.
• Depending on the task we are solving, we can then label the leaves with a class, a probability, a
real value, and so on.
• Feature trees whose leaves are labelled with classes are commonly called decision trees.
(left) A feature tree combining two Boolean features. Each internal node or split is labelled with a feature, and each edge
emanating from a split is labelled with a feature value.
Each leaf therefore corresponds to a unique combination of feature values. Also indicated in each leaf is the class
distribution derived from the training set.
(Right) A feature tree partitions the instance space into rectangular regions, one for each leaf. We can clearly see that the
majority of ham lives in the lower left-hand corner.
11
Machine Learning Models - Types
Logical Model - Example
The leaves of the tree could be labelled, from left to right, as ham – spam – spam, employing a simple decision rule
called majority class. Alternatively, we could label them with the proportion of spam e-mail occurring in each leaf: from
left to right, 1/3, 2/3, and 4/5. Or, if our task was a regression task, we could label the leaves with predicted real values
or even linear functions of some other, real-valued features.
If we label the leaves in the first figure by majority class, we obtain the following decision list:
Logical models often have different, equivalent formulations. For instance, two alternative formulations for
this model are
The first of these alternative formulations combines the two rules in the original decision list by means of
disjunction (‘or’), denoted by ∨ . This selects a single nonrectangular area in instance space. The second
model formulates a conjunctive condition (‘and’, denoted by ∧ ) for the opposite class (ham) and declares
everything else as spam.
12
Machine Learning Models - Types
Logical Model - Example
We can also represent the same model as un-nested rules, as follows:
Example
Consider the following rules:
As can be seen in the figure, these rules overlap for lottery = 1 ∧ Peter = 1, for which they make
contradictory predictions. Furthermore, they fail to make any predictions for lottery = 0 ∧ Peter = 0.
Such rules are inconsistent and incomplete.
The aim here is to find splits that result in improved purity of the nodes
on the next level, where the purity of a node refers to the degree in
which the training examples belonging to that node are of the same
class.
Once the algorithm has found such a feature, the training set is
partitioned into subsets, one for each node resulting from the split.
For each of these subsets, we again find a good feature to split on, and
so on.
13
Machine Learning Models - Types
Logical Model
An algorithm that works by repeatedly splitting a problem into small
sub-problems is what is called a divide-and-conquer algorithm.
We then remove the covered examples of that class, and repeat the
process. This is sometimes called a separate-and-conquer approach.
The model itself can also easily be inspected by humans, which is why
they are sometimes called declarative.
14
Grouping and Grading Models
Grouping Models often assign majority class to all instances that fall
into the segment.
Rather than applying very simple, local models, they form one global
model over the instance space.
15
Grouping and Grading Models
A good example of grouping models is the tree-based models. They work by
repeatedly splitting the instance space into smaller subsets. Because trees are
usually of limited depth and don’t contain all the available features, the subsets at
the leaves of the tree partition the instance space with some finite resolution.
Instances filtered into the same leaf of the tree are treated the same, regardless of
any features not in the tree that might be able to distinguish them.
Support vector machines and other geometric classifiers are examples of grading
models. Because they work in a Cartesian instance space, they are able to
represent and exploit the minutest differences between instances. As a
consequence, it is always possible to come up with a new test instance that
receives a score that has not been given to any previous test instance.
The distinction between grouping and grading models is relative rather than
absolute, and some models combine both features.
A taxonomy describing
machine learning methods
in terms of the extent to
which they are grading or
grouping models, logical,
geometric or a combination,
and supervised or
unsupervised.
16