Machine Learning
Algorithms which are able to perform certain tasks without explicitly
being programmed.
• Machine learning combines statistics and Computing to enable
Computers
to perform a given task without being programmed to do so.
• Machine learning algorithms tend to improve on doing certain tasks.
• Machine learning can be divided into Supervised, Unsupervised
Supervised Learning
• Supervised learning algorithm are trained on labelled data.
• Labelled data – Data for which the target answer is known. For
example,
if you are shown a picture of a cat and you are told it’s a cat. That is
labelled data.
• Unlabelled data – Data for which the target answer is not known. For
example, if you are shown an image but you are not given information
about the image description.
Input and Output:
The input is the data we want to learn from (e.g., pictures of
animals).
The output is the correct answer or label (e.g., "dog", "cat").
Training Process:
The algorithm tries to learn the relationship between inputs and
outputs by adjusting its internal parameters.
It gets feedback on how well it’s doing by comparing its
predicted outputs with the actual labels.
Goal:
The goal is to generalize well, so it can make accurate
predictions on new, unseen data.
Example: Suppose you want to teach a computer to recognize spam
emails:
You give it a dataset of emails (inputs) that are labeled as
"spam" or "not spam" (outputs).
The model learns patterns in the data that distinguish spam
from non-spam.
Once trained, it can predict whether a new email is spam.
this is used to make classification. For example of spam email to know
what attributes has that distingues from normal emails and so they
will know later which are spams.
For example we want to do a reptile classification.
The table shows sample data for classifying vertebrates into
mammals, reptiles, birds, fish, and amphibians. The attribute set
includes characteristics of vertebrate such as its body temperature,
skin cover and ability to fly. The data set can also be used for binary
classification task such as mammal classification, by grouping the
reptiles, birds, fish, and amphibians into a single category called
nonmammals.
Unsupervised Learning
Unsupervised learning is a type of machine learning where the model
is not given any labels. It only sees the input data, and it tries to
find patterns, structures, or groupings on its own.
Key Concepts:
1. No Labeled Output:
o The data has no predefined labels.
o The model must explore the data and find hidden
patterns.
2. Goal:
o The goal is to discover structure in the data—like
grouping similar items together or reducing the
complexity of the data.
Common Tasks:
1. Clustering – Grouping similar items.
o Example: Grouping customers into segments based on
buying behavior.
o Algorithms: K-Means, Hierarchical Clustering, DBSCAN
2. Dimensionality Reduction – Simplifying data by reducing the
number of features.
o Example: Reducing a 100-feature dataset to 2 or 3
features for visualization.
o Algorithms: PCA (Principal Component Analysis), t-SNE
Example:
Imagine you have a big pile of customer data (age, purchase history,
website visits), but you don’t know anything about them. You want to
group similar customers together to send them tailored marketing
emails.
You give this data to an unsupervised learning algorithm, and it finds
3 natural customer groups:
Group 1: Young, low spenders
Group 2: Middle-aged, frequent buyers
Group 3: Older, high-value customers
You didn’t tell the algorithm what to look for—it found those
patterns by itself.
To know when to use supervised and unsupervised
Decision Tree
Decision tree learning example:
If in a branch is
all negative this means that the classification will be N, otherswise it
will be Y
Which attribute best to choose: We want to choose best attributes
cuz we want the tree as short as possible so it doesn’t get too large.
Values from branch:
In Sunny for example he has [2+, -3] because Yes (positive): D9,
D11 → 2
No (negative): D1, D2, D8 → 3
So [2+, -3]
Then to measure uncertainty is
For example in overcast is
[4+, 0] so it is completely certain
(100%)(these are good)
Entropy formula
Entropy tells us how mixed a set of examples is. If a set is pure (all yes or all no)
ex. 10Y, 0N, entropy is 0. If it’s 50/50 (ex. 5Y , 5N, entropy is 1 — maximum
uncertainty.
We did it like this. We divide 6/8. Is 8 cuz 6 + 2 =8. Then follow the formula
When you get 3 branches?
When you have now all the GAIN values the best attribute is one with highest value
and worst lowest value.
Gini Index
Gini Index is another index like Entropy which is used to decide the splitting of an
attribute on a decision tree