CS L03 MachineLearning Basics 01
CS L03 MachineLearning Basics 01
Session : 03
Title : Basics for Machine Learning - I
Agenda
• Primary Data Types
• Structured, Unstructured, Labeled and Unlabeled data
• Data Selection
• Data Sampling
• Machine Learning Fundamentals
• Supervised
• Unsupervised
• Reinforcement
• Labelled Data
• Any data which has a characteristic,
category, or attributes assigned to it
can be referred to as labelled data.
• Examples: photo of a cat, height of a
person, price of a product etc.
• Unlabelled Data
• Any data that does not have any labels
specifying its characteristics, identity,
classification, or properties can be
considered unlabelled data.
• Examples: photos, videos, or text that
do not have any category or
classification assigned to it.
• Filter methods
• Wrapper methods
• Embedded methods
• Ref: https://www.kdnuggets.com/2023/06/advanced-feature-selection-techniques-machine-
learning-models.html
Technique Description
Chi-Square Test • Statistical test used to assess the relationship between two categorical
variables.
• Analyzes the relationship between a categorical feature and the target
variable.
• A greater Chi-square score shows a stronger link between the feature and
the target i.e. the feature is more important for the classification job.
Technique Description
Forward Selection • Start with an empty feature set and iteratively add features to the set.
Backward Selection • Opposite of forward selection method
• Start with entire feature set and iteratively remove features
Exhaustive Feature • Compares the performance of all possible feature sub-sets
Selection • Chooses the best performing feature sub-set
Recursive Feature • Starts with whole feature set
Selection • Eliminates features repeatedly depending on their relevance as determined
by learning algorithm
Recursive Feature • Selects the best subset of features for the estimator by removing 0 to N
Elimination with features iteratively using recursive feature elimination.
Cross Validation • Selects the best subset based on the accuracy or cross-validation score of
the model.
Minimize = RSS + L1
1 2 3 4 5
• Selection bias
• Bias is the selection of individuals in the sample that isn’t random.
• Sample cannot be representative of the population that is to be analyzed.
• Sampling error
• Statistical error that occurs when the researcher doesn’t select a sample
that represents the entire population of data.
• Results based on the sample don’t represent the results that would have
been obtained from the entire population.
Why Machine Learning? some words or phrases (such as “4U,” “credit card,” “free,” and “amazing”)
come up a lot in the subject. Perhaps you would also notice a few other
in the sender’s name, the email’s body, and so on.
2. You would write a detection algorithm for each of the patterns that you
and your program would flag emails as spam if a number of these patt
detected.
Email Spam: Traditional Approach 3. You would test your program, and repeat steps 1 and 2 until it is good eno
• Analyze what spam typically looks like.
– Occurrence of some words or phrases (such
as “4U,” “credit card,” “free,” and “amazing”)
tend to come up a lot in the subject.
– Other similar patterns in sender’s name,
email’s subject and body, and so on.
• Write a detection algorithm for each of the
patterns observed for the program to flag
emails as spam if a number of these
patterns are detected. Figure 1-1. The traditional approach
• A spam filter based on MachineAnother area where Machine Learning shines is for problems that either are
Learning techniques automaticallyplex for traditional approaches or have no known algorithm. For example,
speech recognition: say you want to start simple and write a program capab
notices that “For U” has become
tinguishing the words “one” and “two.” You might notice that the word “tw
unusually frequent in spam flagged
with a by
high-pitch sound (“T”), so you could hardcode an algorithm that
users, and it starts flagging them
high-pitch sound intensity and use that to distinguish ones and twos. Obvio
without programmatic interventiontechnique will not scale to thousands of words spoken by millions of very
Why Machine Learning? believes are the best predictors of spam. Sometimes this will reveal unsuspe
relations or new trends, and thereby lead to a better understanding of the pr
Applying ML techniques to dig into large amounts of data can help discove
that were not immediately apparent. This is called data mining.
• ML can solve problems that are too complex for
traditional approach like speech recognition.
– Traditional approach can never scale to the required
level.
– Machine Learning algorithm can learn by itself using
many examples of audio.
• ML can help humans learn as algorithms can be
inspected to see what they have learned.
– For instance, spam filter can be inspected to reveal the
list of words and combinations of words that it believes
Figure 1-4. Machine Learning can help humans learn
are the predictors of spam.
To summarize, Machine Learning is great for:
– Sometimes this may reveal unsuspected correlations
or new trends leading to a better understanding of • Problems
the for which existing solutions require a lot of hand-tuning or lo
rules: one Machine Learning algorithm can often simplify code and per
problem.
ter.
• ML techniques can dig into large amounts of data
• Complex problems for which there is no good solution at all using a t
and help discover patterns that may not approach: the best Machine Learning techniques can find a solution.
immediately apparent. This is data mining. • Fluctuating environments: a Machine Learning system can adapt to new
• Getting insights about complex problems and large amounts of data.
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
tes
for
Algorithm 1. Decision tree induction. BITS Pilani, Pilani Campus
Machine Learning in Nutshell
Image recognition: Machine learning can be used for face detection in an image as well. There is a
separate category for each person in a database of several people.
Speech Recognition: It is the translation of spoken words into the text. It is used in voice searches
and more. Voice user interfaces include voice dialling, call routing, and appliance control. It can also
be used a simple data entry and the preparation of structured documents.
Financial industry and trading: companies use ML in fraud investigations and credit checks.
Examples: wearable fitness tracker like Fitbit, or intelligent home assistant like Google Home.