0% found this document useful (0 votes)
0 views

Clustering

The document discusses clustering, particularly K-means clustering, as an unsupervised learning technique used to group unlabeled data into distinct clusters based on similarities. It highlights the challenges of evaluating clusters and the importance of domain knowledge in interpreting results. Additionally, it presents a consulting project scenario where clustering is applied to analyze hacker behavior based on various session data features.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Clustering

The document discusses clustering, particularly K-means clustering, as an unsupervised learning technique used to group unlabeled data into distinct clusters based on similarities. It highlights the challenges of evaluating clusters and the importance of domain knowledge in interpreting results. Additionally, it presents a consulting project scenario where clustering is applied to analyze hacker behavior based on various session data features.

Uploaded by

abhimanyu thakur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Clustering

Let’s learn something!


Python and Spark

● We’ve seen how to deal with labeled


data, but what about unlabeled data?
● Often you’ll find yourself trying to create
groups from data, instead of trying to
predict classes or values.
Python and Spark

● This sort of problem is known as


clustering, you can think of it as an
attempt to create labels.
● You input some unlabeled data, and the
unsupervised learning algorithm
returns back possible clusters of the data.
Python and Spark

● This means you have data that only


contains features and you want to see if
there are patterns in the data that would
allow you to create groups or clusters.
Python and Spark

● This is a key distinction from our previous


supervised learning tasks, where we
had historical labeled data.
● Now we will have unlabeled data, and
attempt to “discover” possible labels,
through clustering.
Python and Spark

● By the nature of this problem, it can be


difficult to evaluate the groups or
clusters for “correctness”.
● A large part of being able to interpret the
clusters assigned comes down to
domain knowledge!
Python and Spark

● Maybe you have some customer data, and


then cluster them into distinct groups.
● It will be up to you to decide what the
groups actually represent.
● Sometimes this is easy, sometimes it’s
really hard!
Python and Spark

● For example, you could cluster tumors into


two groups, hoping to separate between
benign and malignant.
● But there is no guarantee that the clusters
will fall along those lines, it will just split
into the two most separable groups.
Python and Spark

● Also depending on the clustering


algorithm, it may be up to you to decide
beforehand how many clusters you expect
to create!
Python and Spark

● A lot of clustering problems have no 100%


correct approach or answer, that is the
nature of unsupervised learning!
● Let’s continue by discussing K-means
clustering.
Reading Assignment

Chapter 10 of
Introduction to Statistical Learning
By Gareth James, et al.
K Means Clustering

K Means Clustering is an unsupervised learning


algorithm that will attempt to group similar clusters
together in your data.
So what does a typical clustering problem look like?
● Cluster Similar Documents
● Cluster Customers based on Features
● Market Segmentation
● Identify similar physical groups
K Means Clustering

● The overall goal is to divide data into distinct


groups such that observations within each group
are similar
K Means Clustering

The K Means Algorithm


● Choose a number of Clusters “K”
● Randomly assign each point to a cluster
● Until clusters stop changing, repeat the following:
○ For each cluster, compute the cluster centroid
by taking the mean vector of points in the
cluster
○ Assign each data point to the cluster for which
the centroid is the closest
K Means Clustering
Choosing a K Value
Choosing a K Value

● There is no easy answer for choosing a “best” K


value
● One way is the elbow method
First of all, compute the sum of squared error (SSE) for
some values of k (for example 2, 4, 6, 8, etc.).
The SSE is defined as the sum of the squared distance
between each member of the cluster and its centroid.
Choosing a K Value

If you plot k against the SSE, you will see that the error
decreases as k gets larger; this is because when the
number of clusters increases, they should be smaller,
so distortion is also smaller.
The idea of the elbow method is to choose the k at
which the SSE decreases abruptly.
This produces an "elbow effect" in the graph, as you
can see in the following picture:
Choosing a K Value
Choosing a K Value

● Pyspark by itself doesn’t support a


plotting mechanism, but you could use
collect() and then plot the results with
matplotlib or other visualization
libraries.
Choosing a K Value

● But don’t take this as a strict rule when


choosing a K value!
● A lot of depends more on the context of
the exact situation (domain knowledge)
● We’ll try our best to get a feel for this
with the examples and consulting
projects!
K-Means Clustering
Documentation
Example
Let’s learn something!
Python and Spark

● Let’s work through the documentation


example for clustering.
● Pay close attention to how we don’t need
the label column (which makes sense
given clustering)
Python and Spark

● The documentation’s example is a bit


peculiar in its choice of data set, but we’ll
explain it along the way.
● Hopefully our own custom code along
will clarify things further!
● Let’s get started!
K-Means Clustering
Code Along
Python and Spark

● We’ll work through a real data set


containing some data on three distinct
seed types.
● Notebook: Clustering Code Along.ipynb
Python and Spark

● For certain Machine Learning algorithms,


it is a good idea to scale your data.
● Drops in model performance can occur
with highly dimensional data, so we’ll
practice scaling features using PySpark!
Python and Spark

● Remember, there won’t be any


confusion matrix or classification test
results.
● This is unsupervised learning!
● Meaning we don’t have the original
labels to actually perform some sort of
test against!
Python and Spark

● This is a common point of confusion for


beginners, you can’t easily check to see
how well your clustering algorithm
performed, this is the difficulty of all
unsupervised tasks!
● Let’s get started!
K-Means Clustering
Consulting Project
Python and Spark

● You’re becoming world famous due to


your machine learning skills!
● A technology start-up in California needs
your help!
Python and Spark

● It’s time for


you to go to
San Francisco
to help out a
tech startup!
Python and Spark

● They’ve been
recently
hacked and
need your help
finding out
about the
hackers!
Python and Spark

● Luckily their forensic engineers have


grabbed valuable data about the hacks,
including information like session
time,locations, wpm typing speed, etc.
Python and Spark

● The forensic engineer relates to you what


she has been able to figure out so far, she
has been able to grab meta-data of each
session that the hackers used to connect
to their servers.
● These are the features of the data...
Python and Spark

● 'Session_Connection_Time': How long the session lasted in


minutes
● 'Bytes Transferred': Number of MB transferred during
session
● 'Kali_Trace_Used': Indicates if the hacker was using Kali
Linux
● 'Servers_Corrupted': Number of server corrupted during the
attack
● 'Pages_Corrupted': Number of pages illegally accessed
● 'Location': Location attack came from (Probably useless
because the hackers used VPNs)
● 'WPM_Typing_Speed': Their estimated typing speed based
on session logs.
Python and Spark

● The technology firm has 3 potential


hackers that perpetrated the attack.
● They are certain of the first two hackers
but they aren't very sure if the third
hacker was involved or not.
● They have requested your help!
Python and Spark

● Can you help figure out whether or not


the third suspect had anything to do
with the attacks, or was it just two
hackers?
● It's probably not possible to know for
sure, but maybe what you've just learned
about Clustering can help!
Python and Spark

● One last key fact, the forensic engineer


knows that the hackers trade off attacks.
● Meaning they should each have roughly
the same amount of attacks.
Python and Spark

● For example if there were 100 total


attacks, then in a 2 hacker situation each
should have about 50 hacks, in a three
hacker situation each would have about
33 hacks.
Python and Spark

● The engineer believes this is the key


element to solving this, but doesn't know
how to distinguish this unlabeled data
into groups of hackers.
Python and Spark

● Best of luck with this project, it should be


a fun one!
● If you get stuck, feel free to go straight to
the solution lecture.
● Enjoy!
K-Means Clustering
Consulting Project
Solutions

You might also like