m3 final-1
m3 final-1
• Why Linear Regression and k-NN are poor choices for Filtering Spam,
Naive Bayes and why it works for Filtering Spam, Data Wrangling: APIs
and other tools for scrapping the Web.
Differences between supervised learning and
unsupervised learning:
difference between K means and KNN
Group 1 Group 3
Group 2 Group 4
New data
Group 2
• There will be a total of 4 groups.
• New data has arrived, and we will assign it to Group 2.
The intuition behind k-NN (Idea of k-NN)
1. The main idea of k-NN is to find other items that are most similar to
the new item (based on their characteristics) and look at their labels. The
label that most similar items have is likely the label for the new item too.
2. Majority Vote : For the new item, take a “majority vote” from the
similar items’ labels. The label with the most votes is chosen. If there’s a
tie, pick one label randomly from the tied labels.
The intuition behind k-NN (Idea of k-NN)
3. Automating k-NN:
To do this automatically, you need to decide two things:
• Similarity Measure : First, define how you’ll measure similarity or closeness
between items (like age or income differences).
• Number of Neighbors (k): Then, decide how many neighbors (similar items) should
“vote” on the label.
• This number is called “k.”
• Once you have these, you can find the closest items (neighbors) for any new item,
and let their labels decide the label of the new item.
Example with credit scores
• imagine you have information about people’s ages, incomes, and
whether their credit rating is “high” or “low.” You want to use their
age and income to predict the credit rating for a new person.
plot each person as a point on a graph, labeling
them with a circle for “low” credit.
Identify credit scores of ?( person )
• Now, say a new person shows up who is 57 years old and makes
$37,000. What credit rating should they get?(low or high )
• By looking at people with similar ages and incomes nearby on the
graph, you can make a good guess.
• To do this more precisely, you can use an algorithm called k-nearest
neighbors (k-NN).
Consider neighbours data points
high
process for using k-NN:
1. Choose a Similarity Measure: Decide how you’ll measure how close or similar
items are to each other.
2. Split the Data: Divide your labeled data into two parts: one for training (to teach
the model) and one for testing (to check how well it works).
3. Choose an Evaluation Metric: Pick a way to measure how well the model is
working. A common choice is the "misclassification rate," which tells you how
often the model makes mistakes.
4. Test Different k Values: Run k-NN multiple times with different values for k (the
number of neighbors) and see how each performs using the evaluation metric.
5. Find the Best k: Choose the k value that gives the best results based on the
evaluation metric.
6. Make Predictions: Now that you have the best k, use the training data to predict
labels for new data that only has ages and incomes, but no labels yet.
Other metrics : to calculate distance
1.Euclidean Distance
Measures the "straight-line" distance between two points in space, like the shortest path between
two locations on a map.
2.Cosine Similarity
Measures how similar two things are by looking at the angle between them, rather than the
distance. Useful for comparing direction rather than size (e.g., how similar two sentences are).
3.Jaccard Distance or Similarity
Compares two sets by looking at what they have in common versus what they don’t. It’s the ratio of
shared items to total unique items.
Other metrics : to calculate distance
2.Cosine Similarity
Other metrics : to calculate distance
3.Jaccard Distance or Similarity
4.Mahalanobis Distance
A distance measure that accounts for the spread and relationships between data, making it more accurate
when data isn’t evenly distributed.
5.Hamming Distance
•It measures how many positions are different between two strings or sequences of the same length.
•Simply count the number of mismatched characters.
Example:
1.Compare "olive" and "ocean":
1. Both strings have 5 characters.
2. The letters at each position:
1. o = o (match, no difference)
2. l ≠ c (1 difference)
3. i ≠ e (1 difference)
4. v ≠ a (1 difference)
5. e ≠ n (1 difference)
3. Total differences = 4.
• Use Cases:
• DNA sequence comparison.
• Detecting errors in transmitted data (e.g., bit differences in binary
code).
6.Manhattan Distance
Measures the "grid-like" distance between two points, like walking blocks in a city rather than going
straight through buildings.
6. Manhattan Distance :
•Imagine you are walking in a city where you can only move along the streets (no cutting through buildings or diagonally).
•Manhattan Distance is the total number of blocks you'd walk to get from one point to another.
• How It Works:
1.Take two points (for example, two houses on a grid).
2.For each direction (up-down or left-right), find the difference between their positions.
3.Add up all these differences.
Formula:
Manhattan Distance :
•Imagine you are walking in a city where you can only move along the streets (no cutting through buildings or diagonally).
•Manhattan Distance is the total number of blocks you'd walk to get from one point to another.
• How It Works:
1.Take two points (for example, two houses on a grid).
2.For each direction (up-down or left-right), find the difference between their positions.
3.Add up all these differences.
Formula:
Example:
If one house is at (2, 4) and another is at (5, 1):
•Difference in x: ∣|2 - 5| = 3
•Difference in y: |4 - 1| = 3
•Total Distance = 3 + 3 = 6
Why It's Useful:
It’s great for finding distances in places where you can’t move diagonally, like city streets, or when comparing data in a grid-like way.
k-Nearest, Neighbors (k-NN) :supervised
algorithm
• Uses in identify or classify the groups
• Nearest neighbor
• Example 2 : We can ask the neighbors for their opinion, such as
determining if an apple is a fruit or a vegetable.
• We need to measure the distance to identify the nearest neighbors.
Algorithms
1.KNN
2.K – means
KNN means : ( supervised algorithm )(steps)
Example :to predict the movie genre using the K-Means algorithm, we
follow a series of steps.
1. Data collection : Gather a dataset of movies with features such as duration,
average rating ,genre (already availed data or labeled data )
2. we have predict new movie genre based on previous information
3. Find neighbours by calculating distance by using Euclidean distance
4. Identify the nearest negibiour based on k value
Note : calculate Euclidean distance d
• d(P , Q)
d y2 Q
Q
Q y1
P
P
x1 x2
KNN : supervised algorithm
Predicting movie genre
predict the genre of “summer” movie with IMDB rating 7.4 and
duration 114 minutes
Movie 1
Movie 2
Movie 3
Movie 4
• Identify the nearest neighbor based on k value
• We check single neighbour k=1
• Three neighbours k=3
•Purpose of k-means:
•K-means is used for grouping or "clustering" data points based on their similarities. It helps us identify
groups of similar users (or data points) automatically.
•For example, marketing can use clustering to target specific groups with different ads or offers.
•How It Works:
•Each data point (user) is defined by features, like age, income, gender, etc.
•Instead of manually grouping users based on these features, k-means automatically finds these groups based
on how close data points are to each other.
•Why Use k-means:
•It’s useful when you want to offer personalized experiences, like different
ads for different user groups.
•Models can work better if tailored to specific groups.
•In complex data with many features, an algorithm can help find patterns
that would be hard to see manually.
•with simple code, such as in R with the command:
•K means(x, centers) where x is your dataset, and centers is the number of
clusters.
Basic Steps of k-means:
•centers:
•Specifies the number of clusters (k).
•Example: If you expect 3 clusters, set centers = 3.
•iter.max:
•The maximum number of iterations the algorithm will run to find the best clusters.
•Default is 10, but you can increase it if needed.
•nstart:
•Determines how many random starting positions to try for the centroids.
•Increasing this can improve results by finding better clusters.
Algorithm:
•Specifies the method for calculating the clusters. Options include:
•Hartigan-Wong (default): Fast and accurate for many cases.
•Lloyd: A simpler method, useful in some scenarios.
•Forgy: A basic approach, less commonly used.
•MacQueen: Often used for online clustering.
k-means :unsupervised algorithm
• For example :
• c1 might be assigned to cluster k1, and c2 might be assigned to cluster
k2.
Step 2 : randomly assign centroid value
K1 K2
• C2
C1
(20,500) (40,1000)
C2
X Y X, Y
•
INTIALIZATED TO
CENTROID C1 C2
WE HAVE TO Observe C3 TO C7
STEP 3 : Identify the nearest Neighbours for others customers ( whether others
customers choose k1 cluster , centroid c1 and k2 cluster , centroid c2 )
K2 = 40100 = 200.249
• So c3 nearest neighbour is k2 , (distance of k2 is less ), add c3 to k2
k2 k2
C2
(40,1000) C2
(35,900)
C3
(30,800)
Step 4: Updated centroid value
K1 K2
• C1 C2
(20,500) (35,900)
C2
C3
Step 5 : find nearest cluster for c4
Case 1 : take k1 ( c1 centroid ) as reference
• C4 = 18 , 300 ( age , amount )
• C1 = 20 , 500 ( age , amount )
• x2 – x1 = 18 -20 = 2 , 2^2 = 4
• y2-y1 = 300 – 500 = 200 , 200 ^2 = 40 000
• 4+ 40000 = 40004
k1 = 40004 = 200.0
Step 5 : find nearest cluster for c4
Case 2 : take k2 ( c2 centroid ) as reference
• C4 = 18 , 300 ( age , amount )
• K2( C1) = 35 , 900 ( age , amount )
• So k 1 is nearer
• Assign c4 to k1
Step 5: nearest cluster for c4 is K1 so add c4 to k1
K1 K2
• C1 C2
(20,500) (35,900)
C2
C4 C3
Step 6 : update centroid value
• After calculate nearest value first customer value , update centroid
k1 k1
C1
(20,500) C1 , C4
(19,400)
C4
(18,300)
Step 4: Updated centroid value
K1 K2
• C1,C4 C2 , C3
(19,400) (35,900)
C2
Explanation of the k-NN R Code Example
• The code demonstrates how to use the k-NN (k-Nearest Neighbors)
algorithm in R to classify data points into categories like "high credit"
or "low credit." Below is a detailed step-by-step explanation of the
process:
1.Prepare the Dataset
• 1:The dataset includes columns like Age, Income, and Credit (target labels:
"high" or "low").
• > head(data)
• age income credit
1 69 3 low
2 66 3 low
3 49 4 low
4 49 26 high
5 58 57 high
6 44 71 high
2.Split the Data into Training and Testing Sets:
• Define the size of the dataset:
• Use 80% of the data for training, leaving 20% for testing.
3.Calculate the number of test labels:
• This gives the size of the test set (20% of 1000 = 200 points).
• 2. sampling.rate * n.points:
• This calculates how many rows to pick for training.
• For instance, if sampling.rate = 0.8 (80%) and n.points = 10, it calculates:0.8 * 10
= 8.So, we need to pick 8 rows.
• sample() function:The sample() function randomly picks numbers
from the list (1:n.points).
• size (how many numbers to pick): This is sampling.rate * n.points (8
in our example).
• replace = FALSE: Ensures no number is picked more than once.
• Example: If 1:n.points = 1, 2, 3, ..., 10, sample() might pick:2, 5, 7, 1,
9, 3, 6, 8.
• Result:training will store these randomly selected numbers (row
indices).Example: training = c(2, 5, 7, 1, 9, 3, 6, 8).
4. Create the training set:
• train <- subset(data[training, ], select = c(Age, Income))
• data$Credit:data is your full dataset, and Credit is a column in the dataset (usually
the target variable, which you're trying to predict, like credit score or label).
• [training]:training is a list of row numbers you selected earlier for your training
data.
• This selects the Credit values (target labels) for the rows that are in the training
set.
• cl <- ...:This stores the selected Credit values into a variable called cl.So, cl now
holds the labels (target values) for the training set.
• Example:If training = c(2, 4, 5, 6), cl will contain the Credit values for rows 2, 4,
5, and 6 from the Credit column in the dataset.
Extract testing labels
• data$Credit: it refers to the Credit column in the dataset, which contains the target values
(like credit scores or labels).
• [testing]:testing is a list of row numbers selected for the test set (the rows not included in
the training set).
• This selects the Credit values (target labels) for the rows that are in the testing set.
• true.labels <- ...:This stores the selected Credit values into a variable called
true.labels.true.
• labels now holds the actual labels (target values) for the testing set.
• Example:
• If testing = c(1, 3, 7, 8), true.labels will contain the Credit values for rows 1, 3, 7, and 8
from the Credit column in the dataset.
Part 2
• RUN KNN algorithm
Run the k-NN Algorithm:
• k misclassification.rate
• 1, 0.28
• 3, 0.26
• 5, 0.23 <-- Lowest misclassification rate
• Final Prediction:
• With the best k (e.g., k = 5), the final classification for a new data
point is made using
• Output: low
Note :
• To calculate the misclassification rate, we need to classify the new
data point (57, 37) using different values of 𝑘 (e.g., 𝑘=1,3,5 k=1,3,5)
and compare the predictions with the actual labels of the test data.
• New Data Point (Test Point):The new point is (57, 37), and we need to
classify it. To calculate the misclassification rate, we assume there are
multiple test points with known true labels. However, for simplicity,
we'll compute the results assuming this is part of a larger dataset.
Step 1: Calculate DistancesWe calculate the Euclidean distance between
the new point (57, 37) and each point in the training data:
Outline
• Spam Filters
• Naive Bayes
• Wrangling
Cont….
• You might see that some of the text looks like spam.
• How did you realize this?
• Can you write code to automatically filter out the spam like your
brain does?
had a few ideas about what might be clear
signs of spam
• Keywords: Any email is spam if it contains not relevant references (e.g., "Ad
Credits," "You have Won").
• Problem: Spammers change the spelling to bypass this rule.
• Subject Length: The length of the subject line might indicate spam.
• Punctuation: Excessive use of exclamation points or other punctuation might
indicate spam.
• Problem: Some words like "Yahoo!" are authentic, so the rule shouldn't be too
simplistic.Probabilistic
• Model: Instead of simple rules, use many rules that work together to calculate
the probability of an email being spam. This is a great idea.
• Advanced Techniques: Techniques like k-nearest neighbors or linear regression,
which you learned about earlier, do not apply well to this problem.
Why Won’t Linear Regression Work for Filtering Spam?
• Dataset: Imagine a table where each row represents a different email message,
identified by an email ID.
• Features as Words: Each word in the email becomes a feature (a column in the
table).
• Example Feature: Create a column called "You_Have_Won.“
• Binary Feature: If an email has the word "You_Have_Won" at least once, put a 1
in that column; otherwise, put a 0.
• Frequency Feature: Alternatively, you can put the number of times the word
"You_Have_Won" appears in the email.
• Columns as Words: Each column in the table represents the appearance of a
different word.
• Training Set: To use linear regression, we need a set of emails
where each email is labeled as spam or not spam.
• Human Labeling: One way to get these labels is by having people
manually mark each email as spam or not. This works but takes a lot
of time.
• Existing Spam Filter: Another way is to use labels from an existing
spam filter, like Gmail's spam filter, which already marks emails as
spam or not.
• Binary Target: Our target is binary (0 for not spam, 1 for spam). Linear
regression gives a number, not just 0 or 1.
• Model Issue: Linear regression is for continuous output, not binary. This isn’t
ideal.
• Model Fit: We should use a model appropriate for our data. But theoretically, we
could still try to fit it in R language .
• Variable Problem: We have too many variables (100,000 words) compared to
observations (10,000 emails). This won’t work.
• Feature Selection: With expert help, we could limit it to 100 important words.
But still, linear regression isn’t the right model for a binary outcome.
How About k-nearest Neighbors?
• Clustering Technique: K-means is a clustering technique used to group similar
data points together.
• Unsupervised Learning: It doesn't work well for spam filtering because it's an
unsupervised learning method, meaning it doesn't use labeled data (spam or not
spam) to train.
• No Clear Labels: Since spam filtering needs clear labels (spam or not spam) for
training, k-means doesn't fit because it doesn't have these labels.
• Centroid-based: K-means works by finding centroids (centers) of clusters, but
in spam filtering, there's no clear way to define centroids for spam and non-
spam emails.
Cont ……
• High Dimensionality: With 10,000 emails and 100,000 words, we have a lot of
dimensions to consider.
• Computational Burden: Working in a 100,000-dimensional space requires a ton
of computational work, especially when computing distances between points.
• Curse of Dimensionality: Even more fundamentally, in such high-dimensional
spaces, our nearest neighbors end up being very far away from each other.
• K-NN Problem: This distance issue makes the k-nearest neighbors (k-NN)
algorithm perform poorly in this scenario.
Note :Case 1 : If we pick a ball from two bags, what's
the chance (Probability ) it's red?
• Bag 1 :
Step 1:
Bag Probability If we select bag 1:
2 RED Probability of choosing from two bags: We
3 BLACK must select 1 bag, so for the first bag:
4 RED
3 BLACK
• Probability : If we pick a ball from two bags, what's the chance
(Probability ) it's red :
• Probability P ( R ) = bag 1 + bag 2 = 17/35
• 1 /2 * 2 /5 + 1 /2 * 4/7 = 1 /5 + 2 /7 = 17 /35
• Bayes algorithm :
• We need to determine the probability 𝑃(𝐵1∣𝑅), which represents the
likelihood that the red item came from bag 1, given that a red item
was drawn.(like reverse engineering )means we have to identify red
item came from which bag
Problem :
• Bayes' Theorem to find the probability that the red ball was drawn
from Bag 1.
• Understanding the Problem:
• We have two bags:
• Bag 1: 2 Red balls, 3 Black balls = 5balls
• Bag 2: 3 Red balls, 4 Black balls = 7 balls
• One ball is randomly drawn from one of these bags.
• The drawn ball is red.
• We need to find the probability that the red ball came from Bag 1.
The general formula for Bayes Theorem in probability is:
Yes formula :
Single variable
Multi variable
no formula :
Single variable
• P ( Y | X ) = P(X| Y)⋅P(Y) / P(X)
• HERE Y = Bag1
• X = RED Ball
• P ( B1 | R ) = P ( R| B1 ) * B1 / P (R )
• P ( Y | X ) = P(X| Y )⋅P(Y) / P(X)
• = 2 / 5 * 1/2 / (17 /35 )
P ( Y | X ) = P(X| Y)*P(Y)
P(X)
• If there is x multi variable , then formula become
• P ( Y | X ) = P(X1| Y)* P(X2| Y)* P(X3| Y)* P(X4| Y)* P(Xn| Y)* P(Y)
P ( N | X ) = P(X| N)*P(N)
P(X)
• If there is x multi variable , then formula become
• P ( N | X ) = P(X1| N)* P(X2| N)* P(X3| N)* P(X4| N)* P(Xn| N)* P(N)
• P ( N | X ) = P(X1| N)* P(X2| N)* P(X3| N)* P(X4| N)* P(Xn| N)* P(N)
1 YES NO YES
2 NO YES YES
3. YES YES YES
4. NO NO NO
5. YES NO YES
6. NO NO YES
7 YES NO YES
8 YES NO NO
9 NO YES YES
10 NO YES NO
• condition : "The information provided about COVID-19 and the flu
helps determine whether fever is present or not.“
• Yes formula :
• P ( Y | X ) = P(X1| Y)* P(X2| Y)* P(X3| Y)* P(X4| Y)* P(Xn| Y)* P(Y)
• P (fever =yes )
• 7 / 10 4
5
• P ( Fever = NO ) 6
• 3 / 10
7
STEP 2 : CONDITION probability(covid will be yes and
fever should be yes )
• COVID ➔ ➔ FEVER
• YES ➔ ➔ YES
1 1
YES NO
COVID 4/7 2 2
3 3
• Covid = 4 yes in table
4 4
• And out of 7 fever (yes)
STEP 2 : CONDITION probability(flu will be yes and fever
should be yes )
• FLU ➔ ➔ FEVER
• YES ➔ ➔ YES
YES NO 1 1
FLU 3/7 2 2
• Covid = 2 NO in table
• And out of 3 fever (NO)
2 2
STEP 4 : CONDITION probability(FLU will be NO and
fever should be NO )
• FLU ➔ ➔ FEVER
• NO ➔ ➔ NO
YES NO
FLU 2/3
1 1
• Covid = 2 NO in table
• And out of 3 fever (NO) 2 2
SUMMARY OF condition probability
YES NO
COVID 4/7 2/3
FLU 3/7 2/3
Step 5 : calculation
Case 1 : Yes formula
X1= flu X2 = COVID , Y= YES
P(fever = yes ) = 7/10
• Basic Statistics:
• Total Emails: 5,172
• Spam Emails: 1,500
• Ham (Not Spam) Emails: 3,672
Step 1 : Probabilities of spam or not spam
• Probabilities:
= 0.29
= 0.70
Step 2 : word analysis
Not spam
Note : formula bayes algorithm
P ( Y | X ) = P(X| Y)*P(Y)
P(X)
Y = spam X = meeting
P(spam∣"meeting") = P("meeting"∣spam)×P(spam)
P (“meeting")
Using of bayes theorem
Step 3 :Probability of an email being spam
when it contains a specific word “ meeting “
=
Step 3 :Probability of an email being spam
when it contains a specific word “ meeting “
Step 3 :Probability of an email being spam when it
contains a specific word “ meeting “
Note : case 2 : NO formula
P ( N | X ) = P(X| N)*P(N)
P(X)
P("meeting")
• Number of Ham Emails containing "meeting": 153
• P("meeting"∣NOT SPAM)= 153 = 0.041
3672
• P(NOT SPAM) = not spam email / total emails = 3672 / 5172 = 0.70
• = = 0.03
• P(NOT spam∣"meeting") = P("meeting"∣not spam)×P(not spam)
• P("meeting")
• Purpose: APIs let you access a website’s data directly without scraping the
actual web pages.
• How They Work: You send a request to the API, and it returns the data in a
structured format (like JSON).
• Benefits: Easier and more reliable than scraping web pages. The data is already in
a clean format.
• Examples: OpenWeather API for weather data.
Web Scraping Tools and Libraries
• When to Use: When no API is available, or you need data from a page’s
content.
• Popular Tools:
• Beautiful Soup: A Python library for parsing HTML and XML documents.
Great for beginners.
• Scrapy: A more powerful Python framework for large-scale scraping projects.
• Selenium: Automates web browsers to interact with web pages and extract
data, useful for pages that need JavaScript to load.
• Summary:
• Web Scraping: Extracts data directly from web pages.
• APIs: Provide structured data directly from the source.
• Tools: Various libraries and tools make scraping easier.
• Ethics: Always follow website guidelines and scrape responsibly.
Example : of web scrapping
1.select * from flickr.photos.search
This means you want to get all the information (columns) from the flickr.photos.search table, which holds
details about photos on Flickr.
2.where text="Cat"
This is a filter that ensures only photos with the tag "Cat" will be included in the results.
3.and api_key="lksdjflskjdfsldkfj"
This part specifies the API key for authentication. It's needed to connect securely to the API and use it. The
provided key should be replaced with your valid key.
4.limit 10
This limits the number of results you get to 10. It's helpful if you don't need too many results for a test or a
preview.
Summary:
This query will return the first 10 photos on Flickr that have the word "Cat" tagged on them. The API key is
required for authentication.
Remember, replace the API key with your actual Flickr key to make the query work.
Laplace Smoothing (also known as Additive
Smoothing)
• Laplace smoothing is a technique used to handle the problem of zero
probability in statistical models, especially when calculating probabilities
for unseen events (like words in text). It helps to smooth out probabilities,
making them less extreme, and ensures that no event has zero probability,
even if it hasn't been observed in the data.
Laplace Smoothing (also known as Additive
Smoothing)
• Why is Laplace Smoothing Needed?
• In tasks like text classification (e.g., email spam detection), we may
encounter words in the test data (new data ) that did not appear in the
training data(old data ) .
• Without smoothing, these words would have a probability of zero, which
would make the whole model fail to classify emails correctly.
• Laplace smoothing adjusts the probabilities to avoid zeroes and gives
unseen events (words) a small, non-zero probability.
Steps of Laplace Smoothing: