0% found this document useful (0 votes)
11 views

m3 final-1

Uploaded by

Tejaswini AS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

m3 final-1

Uploaded by

Tejaswini AS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

Module 3

• k-Nearest, Neighbors (k-NN), k-means. One More Machine Learning


Algorithm and Usage in Applications - Motivating application: Filtering
Spam.

• Why Linear Regression and k-NN are poor choices for Filtering Spam,
Naive Bayes and why it works for Filtering Spam, Data Wrangling: APIs
and other tools for scrapping the Web.
Differences between supervised learning and
unsupervised learning:
difference between K means and KNN
Group 1 Group 3

Group 2 Group 4
New data

Group 2
• There will be a total of 4 groups.
• New data has arrived, and we will assign it to Group 2.
The intuition behind k-NN (Idea of k-NN)
1. The main idea of k-NN is to find other items that are most similar to
the new item (based on their characteristics) and look at their labels. The
label that most similar items have is likely the label for the new item too.

2. Majority Vote : For the new item, take a “majority vote” from the
similar items’ labels. The label with the most votes is chosen. If there’s a
tie, pick one label randomly from the tied labels.
The intuition behind k-NN (Idea of k-NN)
3. Automating k-NN:
To do this automatically, you need to decide two things:
• Similarity Measure : First, define how you’ll measure similarity or closeness
between items (like age or income differences).

• Number of Neighbors (k): Then, decide how many neighbors (similar items) should
“vote” on the label.
• This number is called “k.”
• Once you have these, you can find the closest items (neighbors) for any new item,
and let their labels decide the label of the new item.
Example with credit scores
• imagine you have information about people’s ages, incomes, and
whether their credit rating is “high” or “low.” You want to use their
age and income to predict the credit rating for a new person.
plot each person as a point on a graph, labeling
them with a circle for “low” credit.
Identify credit scores of ?( person )
• Now, say a new person shows up who is 57 years old and makes
$37,000. What credit rating should they get?(low or high )
• By looking at people with similar ages and incomes nearby on the
graph, you can make a good guess.
• To do this more precisely, you can use an algorithm called k-nearest
neighbors (k-NN).
Consider neighbours data points

Consider neighbours data points (k=1 or k=3 )


Consider k value will be odd number , other wise if we
select even number than it become tie

We will consider majority


Vote
low Like
low 1. Two low
2. One high
3. Therefore credit score is low

high
process for using k-NN:
1. Choose a Similarity Measure: Decide how you’ll measure how close or similar
items are to each other.
2. Split the Data: Divide your labeled data into two parts: one for training (to teach
the model) and one for testing (to check how well it works).
3. Choose an Evaluation Metric: Pick a way to measure how well the model is
working. A common choice is the "misclassification rate," which tells you how
often the model makes mistakes.
4. Test Different k Values: Run k-NN multiple times with different values for k (the
number of neighbors) and see how each performs using the evaluation metric.
5. Find the Best k: Choose the k value that gives the best results based on the
evaluation metric.
6. Make Predictions: Now that you have the best k, use the training data to predict
labels for new data that only has ages and incomes, but no labels yet.
Other metrics : to calculate distance
1.Euclidean Distance
Measures the "straight-line" distance between two points in space, like the shortest path between
two locations on a map.

2.Cosine Similarity
Measures how similar two things are by looking at the angle between them, rather than the
distance. Useful for comparing direction rather than size (e.g., how similar two sentences are).
3.Jaccard Distance or Similarity
Compares two sets by looking at what they have in common versus what they don’t. It’s the ratio of
shared items to total unique items.
Other metrics : to calculate distance
2.Cosine Similarity
Other metrics : to calculate distance
3.Jaccard Distance or Similarity
4.Mahalanobis Distance
A distance measure that accounts for the spread and relationships between data, making it more accurate
when data isn’t evenly distributed.
5.Hamming Distance
•It measures how many positions are different between two strings or sequences of the same length.
•Simply count the number of mismatched characters.
Example:
1.Compare "olive" and "ocean":
1. Both strings have 5 characters.
2. The letters at each position:
1. o = o (match, no difference)
2. l ≠ c (1 difference)
3. i ≠ e (1 difference)
4. v ≠ a (1 difference)
5. e ≠ n (1 difference)
3. Total differences = 4.
• Use Cases:
• DNA sequence comparison.
• Detecting errors in transmitted data (e.g., bit differences in binary
code).
6.Manhattan Distance
Measures the "grid-like" distance between two points, like walking blocks in a city rather than going
straight through buildings.
6. Manhattan Distance :
•Imagine you are walking in a city where you can only move along the streets (no cutting through buildings or diagonally).
•Manhattan Distance is the total number of blocks you'd walk to get from one point to another.
• How It Works:
1.Take two points (for example, two houses on a grid).
2.For each direction (up-down or left-right), find the difference between their positions.
3.Add up all these differences.
Formula:
Manhattan Distance :
•Imagine you are walking in a city where you can only move along the streets (no cutting through buildings or diagonally).
•Manhattan Distance is the total number of blocks you'd walk to get from one point to another.
• How It Works:
1.Take two points (for example, two houses on a grid).
2.For each direction (up-down or left-right), find the difference between their positions.
3.Add up all these differences.
Formula:

Example:
If one house is at (2, 4) and another is at (5, 1):
•Difference in x: ∣|2 - 5| = 3
•Difference in y: |4 - 1| = 3
•Total Distance = 3 + 3 = 6
Why It's Useful:
It’s great for finding distances in places where you can’t move diagonally, like city streets, or when comparing data in a grid-like way.
k-Nearest, Neighbors (k-NN) :supervised
algorithm
• Uses in identify or classify the groups

• Example 1 : Determining if an apple is a fruit or a vegetable

• Nearest neighbor
• Example 2 : We can ask the neighbors for their opinion, such as
determining if an apple is a fruit or a vegetable.
• We need to measure the distance to identify the nearest neighbors.
Algorithms

1.KNN
2.K – means
KNN means : ( supervised algorithm )(steps)
Example :to predict the movie genre using the K-Means algorithm, we
follow a series of steps.
1. Data collection : Gather a dataset of movies with features such as duration,
average rating ,genre (already availed data or labeled data )
2. we have predict new movie genre based on previous information
3. Find neighbours by calculating distance by using Euclidean distance
4. Identify the nearest negibiour based on k value
Note : calculate Euclidean distance d
• d(P , Q)

d y2 Q
Q
Q y1
P
P

x1 x2
KNN : supervised algorithm
Predicting movie genre
predict the genre of “summer” movie with IMDB rating 7.4 and
duration 114 minutes

• Assume x axis value will be rating x2 = 7.4


• y axis value will be duration y2 = 114 min
Find neighbors by calculating distance by using Euclidean
distance and also predict new movie genre
• 1. assume first movie rating x1 = 8.0 duration y1 = 160 min
• 2. assume second movie rating x1 = 6.2 duration y1 = 170 min
• 3. like wise x1 = 7.2 y1 = 168 min
• 4. like wise x1= 8.2 y 1= 155 min

• = √(7.4 – 8.0)^2 + ( 114-160)^2 = 46


3.Find neighbors by calculating distance by using
Euclidean distance and also predict new movie
genre

Movie 1

Movie 2

Movie 3

Movie 4
• Identify the nearest neighbor based on k value
• We check single neighbour k=1
• Three neighbours k=3

• K=1 (single neighbor ) , lowest value is 41.00 , so last movie is omg2


so genre comedy , so summer movie genre is comedy
• K=3
• We check 3 values
• Check 3 lowest values , 46.00 , 54.00 , 41.00
• 46.00 value related to mission impossible MOVIE genre is Action
• 54.00 value related to rocky MOVIE and genre is comedy
• 41.00 value related to omg2 MOVIE and genre is comedy

• So average result will be comedy , so summer movie genre is comedy


K means(unsupervised )
•Supervised vs. Unsupervised Learning: In supervised learning, we know the "right answer" or label for
each data point and train our model to predict it.
•In unsupervised learning, like k-means, we don’t know the labels. Instead, the algorithm finds patterns or
groups within the data.

•Purpose of k-means:
•K-means is used for grouping or "clustering" data points based on their similarities. It helps us identify
groups of similar users (or data points) automatically.
•For example, marketing can use clustering to target specific groups with different ads or offers.

•How It Works:
•Each data point (user) is defined by features, like age, income, gender, etc.
•Instead of manually grouping users based on these features, k-means automatically finds these groups based
on how close data points are to each other.
•Why Use k-means:
•It’s useful when you want to offer personalized experiences, like different
ads for different user groups.
•Models can work better if tailored to specific groups.
•In complex data with many features, an algorithm can help find patterns
that would be hard to see manually.
•with simple code, such as in R with the command:
•K means(x, centers) where x is your dataset, and centers is the number of
clusters.
Basic Steps of k-means:

1. Pick a number, k, which is the number of clusters or groups you want.


2. Randomly place k"centroids" (center points for clusters) in the data space.
3. Assign each data point to the closest centroid.
4. Move each centroid to the average position of the data points assigned to
it.
5. Repeat the process of re-assigning data points and moving centroids until
the groupings are stable.
Basic Steps of k-means:
•Challenges with k-means:
•Choosing k: Deciding the number of clusters, k, can be tricky. It’s often based on trial
and error.
•Convergence Issues: Sometimes, the algorithm can get "stuck" without a clear solution,
repeating steps without finding stable groupings.
•Interpretability: Not all clusters found by k-means may make sense or be useful.

•Why it’s Popular:


•It’s a fast algorithm.
•Useful in many fields, like marketing and image processing.
•Can be easily implemented with simple code, such as in R with the command: k-
means(x, centers) where x is your dataset, and centers is the number of clusters.
Explanation of Figure 3-9:

• Figure 3-9 shows an example of clustering in two dimensions.


1.Visual Representation: It highlights how data points naturally form
clusters when plotted.
2.Clustering Need: While identifying clusters is easy visually in two
dimensions, it becomes challenging in higher dimensions or with
larger datasets.
3.k-means Algorithm: This algorithm finds clusters automatically by
grouping similar data points based on their features.
Explanation of the R Code:
• The R function kmeans() is used to apply the k-means clustering
algorithm.

• kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-


Wong", "Lloyd", "Forgy", "MacQueen"))
•x:
•Represents the dataset.
•It must be a matrix, where each column is a feature, and each row is a data point.

•centers:
•Specifies the number of clusters (k).
•Example: If you expect 3 clusters, set centers = 3.

•iter.max:
•The maximum number of iterations the algorithm will run to find the best clusters.
•Default is 10, but you can increase it if needed.
•nstart:
•Determines how many random starting positions to try for the centroids.
•Increasing this can improve results by finding better clusters.

Algorithm:
•Specifies the method for calculating the clusters. Options include:
•Hartigan-Wong (default): Fast and accurate for many cases.
•Lloyd: A simpler method, useful in some scenarios.
•Forgy: A basic approach, less commonly used.
•MacQueen: Often used for online clustering.
k-means :unsupervised algorithm

1. Unsupervised Learning: K-means is an unsupervised learning algorithm,


meaning it finds patterns in data without being given labeled examples.
2. Clusters: The algorithm groups data into clusters (groups) based on
their similarities.
3. K Value: You choose a number 'K' which represents the number of
clusters you want to find.
4. Initialization: It starts by randomly placing 'K' points (called
centroids) in the data space.
5. Assignment: Each data point is assigned to the nearest centroid,
forming clusters.
6. Update Centroids: The centroids are then moved to the average
position of all the points in their cluster.
7) Repeat: Steps 5 and 6 are repeated until the centroids no longer move much,
meaning the clusters are stable.
8) Result: The final positions of the centroids represent the center of the
clusters, and each data point belongs to the cluster of its nearest centroid.
9) Purpose: K-means helps to find natural groupings in data, which can be used
for various applications like market segmentation, image compression, and
anomaly detection.
Centroid
Example : classified customer data
• age variable is denoted as X and the amount variable is denoted as Y.
Step 1 : identify how many clusters we want to create k=2

• Example : we want to create two clusters k1 and k2



k1 k2

• We have to assign data to specific clusters based on similarities


Step 2 : find out the centroid and initialization
• Help to identify the clusters
• Initially, randomly assign each customer (such as c1, c2) to a cluster.

• For example :
• c1 might be assigned to cluster k1, and c2 might be assigned to cluster
k2.
Step 2 : randomly assign centroid value

K1 K2

• C2
C1
(20,500) (40,1000)
C2
X Y X, Y

INTIALIZATED TO
CENTROID C1 C2

WE HAVE TO Observe C3 TO C7
STEP 3 : Identify the nearest Neighbours for others customers ( whether others
customers choose k1 cluster , centroid c1 and k2 cluster , centroid c2 )

• By using The Euclidean distance formula , calculate the distance and


assign cluster

• y2 , x2 = observed value ( like c3 to c7 )


• y1, x1 = Centroid value ( c1 , c2 )
Step 3 : find nearest cluster
Case 1 : take k1 ( c1 centroid ) as reference
• C3 = 30 , 800 ( age x2 , amount y2 )
• C1 = 20 , 500 ( age x1 , amount y1 )

• x2 – x1 = 30 -20 = 10 , 10^2 = 100


• y2-y1 = 800 – 500 = 300 , 300 ^2 = 90 000
100+ 90000 = 90100

K1(distance) = 90100 = 300.166


Step 3 : find nearest cluster for C3(k2 reference)
Case 2 : take k2 ( c2 centroid ) as reference
• C3 = 30 , 800 ( age , amount )
• C2 = 40 , 1000 ( age , amount )

• x2 – x1 = 30 -40 = -10 , 10^2 = 100


• y2-y1 = 800 – 1000 = -200 , 200 ^2 = 40000
100+ 40000 = 40100

K2 = 40100 = 200.249
• So c3 nearest neighbour is k2 , (distance of k2 is less ), add c3 to k2

• K1> K2 300 > 200 k2


• K1
C2
(40,1000) ,
C1
(20,500) C3
(30,800)
Step 4 : update centroid value
• After calculate nearest value first customer value , update centroid

• Earlier we have c2 in K2 now add c3 IN K2 , so calculate mean of x and y


• x2 of c2 is 40 and x1 of c3 is 30
• y2 of c2 is 1000 , y1 of c2 is 800

• Mean of age variable X = X2 + X1 = 40+30 =70 , 70/2 = 35


• Mean of Amount variable Y = Y2 + Y1 = 1000+800= 1800/2 = 900
• So update the new centroid value

k2 k2
C2
(40,1000) C2
(35,900)
C3
(30,800)
Step 4: Updated centroid value

K1 K2

• C1 C2
(20,500) (35,900)
C2
C3
Step 5 : find nearest cluster for c4
Case 1 : take k1 ( c1 centroid ) as reference
• C4 = 18 , 300 ( age , amount )
• C1 = 20 , 500 ( age , amount )

• x2 – x1 = 18 -20 = 2 , 2^2 = 4
• y2-y1 = 300 – 500 = 200 , 200 ^2 = 40 000
• 4+ 40000 = 40004

k1 = 40004 = 200.0
Step 5 : find nearest cluster for c4
Case 2 : take k2 ( c2 centroid ) as reference
• C4 = 18 , 300 ( age , amount )
• K2( C1) = 35 , 900 ( age , amount )

• x2 – x1 = 18 -35 = 17 , 17^2 = 289


• y2-y1 = 300 – 900 = 600 , 600 ^2 = 36 0000
• 289+360000 = 360289

k2= 360289 = 600.2


• Distance from c4 to k1 is 200
• Distance from c4 to k2 is 600

• So k 1 is nearer
• Assign c4 to k1
Step 5: nearest cluster for c4 is K1 so add c4 to k1

K1 K2

• C1 C2
(20,500) (35,900)
C2
C4 C3
Step 6 : update centroid value
• After calculate nearest value first customer value , update centroid

• Earlier value of c1 in K1 and now add C4 to K1


• age variable And amount is 20 (x= 20 , y=500 )
• now we added c4 ( x=18 , y=300)

• Mean of age variable X = X2 + X1 = 20+18 =38 , 38/2 = 19


• Mean of Amount variable Y = Y2 + Y1 = 500+300= 800/2 = 400
• So update the new centroid value

k1 k1
C1
(20,500) C1 , C4
(19,400)
C4
(18,300)
Step 4: Updated centroid value

K1 K2

• C1,C4 C2 , C3
(19,400) (35,900)
C2
Explanation of the k-NN R Code Example
• The code demonstrates how to use the k-NN (k-Nearest Neighbors)
algorithm in R to classify data points into categories like "high credit"
or "low credit." Below is a detailed step-by-step explanation of the
process:
1.Prepare the Dataset

• 1:The dataset includes columns like Age, Income, and Credit (target labels:
"high" or "low").

• > head(data)
• age income credit
1 69 3 low
2 66 3 low
3 49 4 low
4 49 26 high
5 58 57 high
6 44 71 high
2.Split the Data into Training and Testing Sets:
• Define the size of the dataset:

• n.points <- 1000

• The dataset has 1000 rows (data points).


3. Specify the sampling rate for training data:

• sampling.rate <- 0.8

• Use 80% of the data for training, leaving 20% for testing.
3.Calculate the number of test labels:

• num.test.set.labels <- n.points * (1 - sampling.rate)

• This gives the size of the test set (20% of 1000 = 200 points).

• num.test.set.labels <- 1000 * (1 - 0.8)


• # num.test.set.labels = 1000 * 0.2
• # num.test.set.labels = 200
3.Randomly select rows for training data
• training <- sample(1:n.points, sampling.rate * n.points, replace = FALSE)

• 1:n.points:This creates a list of numbers starting from 1 to the total number of


rows (n.points).
• Example: If n.points = 10, this would be: 1, 2, 3, 4, ..., 10.

• 2. sampling.rate * n.points:
• This calculates how many rows to pick for training.
• For instance, if sampling.rate = 0.8 (80%) and n.points = 10, it calculates:0.8 * 10
= 8.So, we need to pick 8 rows.
• sample() function:The sample() function randomly picks numbers
from the list (1:n.points).
• size (how many numbers to pick): This is sampling.rate * n.points (8
in our example).
• replace = FALSE: Ensures no number is picked more than once.
• Example: If 1:n.points = 1, 2, 3, ..., 10, sample() might pick:2, 5, 7, 1,
9, 3, 6, 8.
• Result:training will store these randomly selected numbers (row
indices).Example: training = c(2, 5, 7, 1, 9, 3, 6, 8).
4. Create the training set:
• train <- subset(data[training, ], select = c(Age, Income))

• data[training, ]:data is your entire dataset.


• training contains the row indices that you selected for training.
• data[training, ]: This selects all the rows from the dataset that are listed in the
training set.
• Example: If training = c(2, 4, 5, 6), this will select rows 2, 4, 5, and 6 from the
data dataset.
• select = c(Age, Income):This part tells R to only keep the Age and Income
columns from the rows selected in the previous step.It discards any other columns
in the dataset.
• So, you only get the Age and Income features for the rows that are part of the
training set.
4.Create the testing set:

• testing <- setdiff(1:n.points, training)


• test <- subset(data[testing, ], select = c(Age, Income))
• setdiff(1:n.points, training):1:n.points is a list of all row numbers
(from 1 to n.points).training contains the row numbers that were
selected for training.
• setdiff() finds the difference between the two lists: it returns all rows
that are not in training.
• These rows are for the testing set.
• Example: If n.points = 10 and training = c(2, 4, 5, 6),
• setdiff(1:10, c(2, 4, 5, 6)) will give: 1, 3, 7, 8, 9, 10.
• The testing set includes the remaining 20% of rows, with the same
features.
Extract training labels
• cl <- data$Credit[training]

• data$Credit:data is your full dataset, and Credit is a column in the dataset (usually
the target variable, which you're trying to predict, like credit score or label).
• [training]:training is a list of row numbers you selected earlier for your training
data.
• This selects the Credit values (target labels) for the rows that are in the training
set.
• cl <- ...:This stores the selected Credit values into a variable called cl.So, cl now
holds the labels (target values) for the training set.
• Example:If training = c(2, 4, 5, 6), cl will contain the Credit values for rows 2, 4,
5, and 6 from the Credit column in the dataset.
Extract testing labels

• true.labels <- data$Credit[testing]

• data$Credit: it refers to the Credit column in the dataset, which contains the target values
(like credit scores or labels).
• [testing]:testing is a list of row numbers selected for the test set (the rows not included in
the training set).
• This selects the Credit values (target labels) for the rows that are in the testing set.
• true.labels <- ...:This stores the selected Credit values into a variable called
true.labels.true.
• labels now holds the actual labels (target values) for the testing set.
• Example:
• If testing = c(1, 3, 7, 8), true.labels will contain the Credit values for rows 1, 3, 7, and 8
from the Credit column in the dataset.
Part 2
• RUN KNN algorithm
Run the k-NN Algorithm:

• The k-NN function in R is used to classify test points based on their


nearest neighbors.

• knn(train, test, cl, k)


• train: The training set (features like Age and Income).
• test: The test set or new data point(s) to classify.
• cl: The labels (e.g., "high" or "low") for the training set.
• k: The number of nearest neighbors to consider for classification.
Example - Test a New Data Point:
• test <- c(57, 37)
• knn(train, test, cl, k = 5)

• Input: A person aged 57 with an income of $37,000.


• Output: Based on the majority vote of 5 nearest neighbors, the person
is classified as low.
Loop to Find the Best k:
• for (k in 1:20)
•{
• predicted.labels <- knn(train, test, cl, k)
• num.incorrect.labels <- sum(predicted.labels != true.labels)
• misclassification.rate <- num.incorrect.labels / num.test.set.labels
• print(misclassification.rate)
•}
What is a Misclassification Rate?

• The misclassification rate tells us how many predictions are wrong


compared to the total number of test data points.
• It is calculated as:
• Dataset Setup:
• Training Data: This contains the features (e.g., Age, Income) and the
labels (e.g., "Low" or "High").
• Test Data: New data points where the true labels are known.
• This is used to evaluate how well the k-NN model works.
• When k = 1:
• For each test point, the k-NN algorithm looks at the 1 nearest
neighbor and uses its label as the prediction.
• Count how many predictions are wrong.
Choose the Best k:
• Based on the output, the best value of k is chosen as the one with the
lowest misclassification rate.

• k misclassification.rate
• 1, 0.28
• 3, 0.26
• 5, 0.23 <-- Lowest misclassification rate
• Final Prediction:
• With the best k (e.g., k = 5), the final classification for a new data
point is made using

• Output: low
Note :
• To calculate the misclassification rate, we need to classify the new
data point (57, 37) using different values of 𝑘 (e.g., 𝑘=1,3,5 k=1,3,5)
and compare the predictions with the actual labels of the test data.
• New Data Point (Test Point):The new point is (57, 37), and we need to
classify it. To calculate the misclassification rate, we assume there are
multiple test points with known true labels. However, for simplicity,
we'll compute the results assuming this is part of a larger dataset.
Step 1: Calculate DistancesWe calculate the Euclidean distance between
the new point (57, 37) and each point in the training data:
Outline
• Spam Filters
• Naive Bayes
• Wrangling
Cont….
• You might see that some of the text looks like spam.
• How did you realize this?
• Can you write code to automatically filter out the spam like your
brain does?
had a few ideas about what might be clear
signs of spam
• Keywords: Any email is spam if it contains not relevant references (e.g., "Ad
Credits," "You have Won").
• Problem: Spammers change the spelling to bypass this rule.
• Subject Length: The length of the subject line might indicate spam.
• Punctuation: Excessive use of exclamation points or other punctuation might
indicate spam.
• Problem: Some words like "Yahoo!" are authentic, so the rule shouldn't be too
simplistic.Probabilistic
• Model: Instead of simple rules, use many rules that work together to calculate
the probability of an email being spam. This is a great idea.
• Advanced Techniques: Techniques like k-nearest neighbors or linear regression,
which you learned about earlier, do not apply well to this problem.
Why Won’t Linear Regression Work for Filtering Spam?
• Dataset: Imagine a table where each row represents a different email message,
identified by an email ID.
• Features as Words: Each word in the email becomes a feature (a column in the
table).
• Example Feature: Create a column called "You_Have_Won.“
• Binary Feature: If an email has the word "You_Have_Won" at least once, put a 1
in that column; otherwise, put a 0.
• Frequency Feature: Alternatively, you can put the number of times the word
"You_Have_Won" appears in the email.
• Columns as Words: Each column in the table represents the appearance of a
different word.
• Training Set: To use linear regression, we need a set of emails
where each email is labeled as spam or not spam.
• Human Labeling: One way to get these labels is by having people
manually mark each email as spam or not. This works but takes a lot
of time.
• Existing Spam Filter: Another way is to use labels from an existing
spam filter, like Gmail's spam filter, which already marks emails as
spam or not.
• Binary Target: Our target is binary (0 for not spam, 1 for spam). Linear
regression gives a number, not just 0 or 1.
• Model Issue: Linear regression is for continuous output, not binary. This isn’t
ideal.
• Model Fit: We should use a model appropriate for our data. But theoretically, we
could still try to fit it in R language .
• Variable Problem: We have too many variables (100,000 words) compared to
observations (10,000 emails). This won’t work.
• Feature Selection: With expert help, we could limit it to 100 important words.
But still, linear regression isn’t the right model for a binary outcome.
How About k-nearest Neighbors?
• Clustering Technique: K-means is a clustering technique used to group similar
data points together.
• Unsupervised Learning: It doesn't work well for spam filtering because it's an
unsupervised learning method, meaning it doesn't use labeled data (spam or not
spam) to train.
• No Clear Labels: Since spam filtering needs clear labels (spam or not spam) for
training, k-means doesn't fit because it doesn't have these labels.
• Centroid-based: K-means works by finding centroids (centers) of clusters, but
in spam filtering, there's no clear way to define centroids for spam and non-
spam emails.
Cont ……
• High Dimensionality: With 10,000 emails and 100,000 words, we have a lot of
dimensions to consider.
• Computational Burden: Working in a 100,000-dimensional space requires a ton
of computational work, especially when computing distances between points.
• Curse of Dimensionality: Even more fundamentally, in such high-dimensional
spaces, our nearest neighbors end up being very far away from each other.
• K-NN Problem: This distance issue makes the k-nearest neighbors (k-NN)
algorithm perform poorly in this scenario.
Note :Case 1 : If we pick a ball from two bags, what's
the chance (Probability ) it's red?
• Bag 1 :
Step 1:
Bag Probability If we select bag 1:
2 RED Probability of choosing from two bags: We
3 BLACK must select 1 bag, so for the first bag:

Event 1: 1/2 chance


BAG 2
Similarly, if we pick the second bag:
4 RED
Event 2: 1/2 chance
3 BLACK
If we pick a ball from two bags, what's the chance
(Probability ) it's red?
• Bag 1 :
Step 2:

2 RED Probability getting a red from bag 1


3 BLACK 2 /5

BAG 2 Probability getting a red from bag 2


4/7

4 RED
3 BLACK
• Probability : If we pick a ball from two bags, what's the chance
(Probability ) it's red :
• Probability P ( R ) = bag 1 + bag 2 = 17/35
• 1 /2 * 2 /5 + 1 /2 * 4/7 = 1 /5 + 2 /7 = 17 /35

• Bayes algorithm :
• We need to determine the probability 𝑃(𝐵1∣𝑅), which represents the
likelihood that the red item came from bag 1, given that a red item
was drawn.(like reverse engineering )means we have to identify red
item came from which bag
Problem :
• Bayes' Theorem to find the probability that the red ball was drawn
from Bag 1.
• Understanding the Problem:
• We have two bags:
• Bag 1: 2 Red balls, 3 Black balls = 5balls
• Bag 2: 3 Red balls, 4 Black balls = 7 balls
• One ball is randomly drawn from one of these bags.
• The drawn ball is red.
• We need to find the probability that the red ball came from Bag 1.
The general formula for Bayes Theorem in probability is:

Yes formula :
Single variable

Multi variable

no formula :
Single variable
• P ( Y | X ) = P(X| Y)⋅P(Y)​ / P(X)

• HERE Y = Bag1
• X = RED Ball
• P ( B1 | R ) = P ( R| B1 ) * B1 / P (R )
• P ( Y | X ) = P(X| Y )⋅P(Y)​ / P(X)
• = 2 / 5 * 1/2 / (17 /35 )

• P(Y ) = BAG Probability


• P(X) = 17 /35 ( RED coin or ball )
• 2 /5 * ½ / 17 /35 = 0.09
Multi variable formula

• Nayes indicates this variables are independent


• The multivariable formula is used to determine whether an email is
spam (yes) or not (no)."
• Case 1 : yes
• "If the email is identified as spam, the conditions are classified as
'Yes,' and the formula is then applied as follows:

P ( Y | X ) = P(X| Y)*P(Y)​
P(X)
• If there is x multi variable , then formula become
• P ( Y | X ) = P(X1| Y)* P(X2| Y)* P(X3| Y)* P(X4| Y)* P(Xn| Y)* P(Y)

P(X1) * P(X2) * P(X3) * P(Xn)


• Case 2 : NO
• "If the email is identified as spam, the conditions are classified as
'Yes,' and the formula is then applied as follows:

P ( N | X ) = P(X| N)*P(N)​
P(X)
• If there is x multi variable , then formula become
• P ( N | X ) = P(X1| N)* P(X2| N)* P(X3| N)* P(X4| N)* P(Xn| N)* P(N)

• P(X1) * P(X2) * P(X3) * P(Xn)


• P ( Y | X ) = P(X1| Y)* P(X2| Y)* P(X3| Y)* P(X4| Y)* P(Xn| Y)* P(Y)

P(X1) * P(X2) * P(X3) * P(Xn)

• P ( N | X ) = P(X1| N)* P(X2| N)* P(X3| N)* P(X4| N)* P(Xn| N)* P(N)​

• P(X1) * P(X2) * P(X3) * P(Xn)

• "We will focus solely on the numerator, disregarding the denominator, as it


remains constant."
• "Based on the highest probability value, classify the email as 'Yes'
(spam) or 'No' (not spam) and assign it to the corresponding group."
Naive Bayes and why it works for Filtering Spam
• Superverse learning algorithm
• Based on bayes theorem
• 4. Decision:
• The email is classified as spam ('Yes') or not spam ('No') based on
which group has the highest calculated probability.

• 5. Hospital management fever:


• Helps hospitals manage patients with fever efficiently.
• Predicts whether a patient's fever is due to a specific cause.
• Uses symptoms like temperature, duration, and other factors to make
predictions.
Example : Based on the information predict fever of person
Person COVID FLU FEVER
YES / NO YES / YES / NO
NO

1 YES NO YES
2 NO YES YES
3. YES YES YES
4. NO NO NO

5. YES NO YES
6. NO NO YES
7 YES NO YES
8 YES NO NO

9 NO YES YES
10 NO YES NO
• condition : "The information provided about COVID-19 and the flu
helps determine whether fever is present or not.“
• Yes formula :
• P ( Y | X ) = P(X1| Y)* P(X2| Y)* P(X3| Y)* P(X4| Y)* P(Xn| Y)* P(Y)​

P(X1) * P(X2) * P(X3) * P(Xn)

X1= flu X2 = COVID , Y= YES


P( YES | flu , COVID ) = P (FLU | YES ) * ( COVID | YES ) * P(YES)
• condition : "The information provided about COVID-19 and the flu
helps determine whether fever is present or not.“
• NO formula :
• P ( N| X ) = P(X1| N)* P(X2| N)* P(X3| N)* P(X4| N)* P(Xn| N)* P(N)​

P(X1) * P(X2) * P(X3) * P(Xn)

X1= flu X2 = COVID , N= NO


P( NO | flu , COVID ) = P (FLU | NO ) * ( COVID | NO ) * P(NO)
STEP 1 : prior probability
1
• Count fever , YES OR NO 2
3

• P (fever =yes )
• 7 / 10 4
5

• P ( Fever = NO ) 6

• 3 / 10
7
STEP 2 : CONDITION probability(covid will be yes and
fever should be yes )
• COVID ➔ ➔ FEVER
• YES ➔ ➔ YES
1 1
YES NO
COVID 4/7 2 2

3 3
• Covid = 4 yes in table
4 4
• And out of 7 fever (yes)
STEP 2 : CONDITION probability(flu will be yes and fever
should be yes )
• FLU ➔ ➔ FEVER
• YES ➔ ➔ YES
YES NO 1 1
FLU 3/7 2 2

• Covid = 3 yes in table


• And out of 7 fever (yes)
3 3
STEP 3 : CONDITION probability(covid will be NO and
fever should be NO )
• COVID ➔ ➔ FEVER
• NO ➔ ➔ NO
YES NO
COVID 2/3
1 1

• Covid = 2 NO in table
• And out of 3 fever (NO)

2 2
STEP 4 : CONDITION probability(FLU will be NO and
fever should be NO )
• FLU ➔ ➔ FEVER
• NO ➔ ➔ NO
YES NO
FLU 2/3
1 1

• Covid = 2 NO in table
• And out of 3 fever (NO) 2 2
SUMMARY OF condition probability
YES NO
COVID 4/7 2/3
FLU 3/7 2/3
Step 5 : calculation
Case 1 : Yes formula
X1= flu X2 = COVID , Y= YES
P(fever = yes ) = 7/10

P( YES | flu , COVID ) = P (FLU | YES ) * ( COVID | YES ) * P(YES)


= 3/7 * 4/7 * 7/10
= 0.17
Step 6 : calculation
Case 2 : NO formula
X1= flu X2 = COVID , N= NO
P(fever = NO ) = 3/10

P( NO | flu , COVID ) = P (FLU | NO ) * ( COVID | NO ) * P(NO)


= 2/3 * 2/3 * 3/10
= 2/15
= 0.13
• probability FEVER (YES ) > FEVER (NO)
• 0.17 > 0.13

• So Classify as yes group


Spam filtering by Naive Bayes
Explanation
(p(x|c) = probability of word appeared in group(spam or not spam )
• p( c)= probability of group
P(x)= probability of word
• key Points:
• Naive Bayes is simple and effective for text classification.
• It assumes words are independent, which may not always be true in
real life, but it still works well.
• It's good for handling large datasets, like emails, with many features
(words).
Summary

• Goal: Predict whether an email is spam based on the words it contains.


• Key Idea: Use probabilities of words in spam and not spam emails.
• Simplification: Turn complex multiplications into simple additions using
logarithms.
• Weights: Assign importance to each word using precomputed values
(wj​).(priority )
• Bias: Use a constant (w0) to account for overall probabilities.
Spam filtering by Naive Bayes
• We will use a simple example to understand how to determine if an
email is spam based on the word "meeting.“

• Basic Statistics:
• Total Emails: 5,172
• Spam Emails: 1,500
• Ham (Not Spam) Emails: 3,672
Step 1 : Probabilities of spam or not spam

• Probabilities:

= 0.29

= 0.70
Step 2 : word analysis

Not spam
Note : formula bayes algorithm
P ( Y | X ) = P(X| Y)*P(Y)​
P(X)

Y = spam X = meeting
P(spam∣"meeting") = P("meeting"∣spam)×P(spam)​

P (“meeting")
Using of bayes theorem
Step 3 :Probability of an email being spam
when it contains a specific word “ meeting “

=
Step 3 :Probability of an email being spam
when it contains a specific word “ meeting “
Step 3 :Probability of an email being spam when it
contains a specific word “ meeting “
Note : case 2 : NO formula
P ( N | X ) = P(X| N)*P(N)​
P(X)

N = NOT spam X = meeting


P(NOT spam∣"meeting") = P("meeting"∣not spam)×P(not spam)​

P("meeting")
• Number of Ham Emails containing "meeting": 153
• P("meeting"∣NOT SPAM)= 153​ = 0.041
3672

• P(NOT SPAM) = not spam email / total emails = 3672 / 5172 = 0.70

• = = 0.03
• P(NOT spam∣"meeting") = P("meeting"∣not spam)×P(not spam)​

• P("meeting")

• = 0.041 * 0.70 = 0.095 = 95%



• 0.03
Summary
• Summary
• case 1 : P("meeting"∣ SPAM) = 9.4 %
• case 2 : P("meeting"∣ not SPAM) = 95 %

• So email contain word meeting is not spam


Meaning Scrapping :
• Extracting data or accessing web pages
scraping the Web: APIs and Tools

• Web scraping is a way to gather information from websites. Here's a


simple guide to get started:
• 1. What is Web Scraping?
• Collecting Data: Automatically extracting information from web
pages.
• Uses: Gathering data for research, collecting content.
2.APIs (Application Programming Interfaces)

• Purpose: APIs let you access a website’s data directly without scraping the
actual web pages.
• How They Work: You send a request to the API, and it returns the data in a
structured format (like JSON).
• Benefits: Easier and more reliable than scraping web pages. The data is already in
a clean format.
• Examples: OpenWeather API for weather data.
Web Scraping Tools and Libraries

• When to Use: When no API is available, or you need data from a page’s
content.
• Popular Tools:
• Beautiful Soup: A Python library for parsing HTML and XML documents.
Great for beginners.
• Scrapy: A more powerful Python framework for large-scale scraping projects.
• Selenium: Automates web browsers to interact with web pages and extract
data, useful for pages that need JavaScript to load.
• Summary:
• Web Scraping: Extracts data directly from web pages.
• APIs: Provide structured data directly from the source.
• Tools: Various libraries and tools make scraping easier.
• Ethics: Always follow website guidelines and scrape responsibly.
Example : of web scrapping
1.select * from flickr.photos.search
This means you want to get all the information (columns) from the flickr.photos.search table, which holds
details about photos on Flickr.
2.where text="Cat"
This is a filter that ensures only photos with the tag "Cat" will be included in the results.
3.and api_key="lksdjflskjdfsldkfj"
This part specifies the API key for authentication. It's needed to connect securely to the API and use it. The
provided key should be replaced with your valid key.
4.limit 10
This limits the number of results you get to 10. It's helpful if you don't need too many results for a test or a
preview.

Summary:
This query will return the first 10 photos on Flickr that have the word "Cat" tagged on them. The API key is
required for authentication.
Remember, replace the API key with your actual Flickr key to make the query work.
Laplace Smoothing (also known as Additive
Smoothing)
• Laplace smoothing is a technique used to handle the problem of zero
probability in statistical models, especially when calculating probabilities
for unseen events (like words in text). It helps to smooth out probabilities,
making them less extreme, and ensures that no event has zero probability,
even if it hasn't been observed in the data.
Laplace Smoothing (also known as Additive
Smoothing)
• Why is Laplace Smoothing Needed?
• In tasks like text classification (e.g., email spam detection), we may
encounter words in the test data (new data ) that did not appear in the
training data(old data ) .
• Without smoothing, these words would have a probability of zero, which
would make the whole model fail to classify emails correctly.
• Laplace smoothing adjusts the probabilities to avoid zeroes and gives
unseen events (words) a small, non-zero probability.
Steps of Laplace Smoothing:

1.Count the Frequency of Words:


1. In spam email classification, we first count how many times each word appears in
both spam and non-spam emails. These are called word frequencies.
2.Calculate Probabilities:
1. Normally, the probability of a word in spam or non-spam is calculated by dividing
the number of occurrences of that word by the total number of words in that class
(spam or non-spam).
3.Apply Laplace Smoothing:
1. To smooth the probabilities, we add a small constant (usually 1) to the count of each
word, then adjust the total count.
2. This way, even words that appear zero times in a class (group )will have a small
probability instead of zero.
Formula for Laplace Smoothing:

• P(w∣C) = Probability of word w given class C (spam or not-spam).

• count(w,C) = Count of how many times word 𝑤 appears in class 𝐶.


V = Total number of unique words in the vocabulary (across all classes).
total words in 𝐶 = Total number of words in class 𝐶(spam or non-spam).
Example: Spam Email Classification:
• Problem
• We need to classify an email as spam or not spam using the Naive
Bayes algorithm. Probability is calculated for each word in the email.
Example: Spam Email Classification:
With LaPlace smoothing
• Without smoothing, 𝑃(𝐻𝑒𝑙𝑙𝑜∣𝑆𝑝𝑎𝑚)=0
• P(Hello∣Spam)=0, causing the email to never be classified as Spam,
even if other words strongly suggest it.
• With smoothing, probabilities are realistic, and the model works better
for new or rare words.
Laplace smoothening

You might also like