AppliedML-Chap1-Clustering
AppliedML-Chap1-Clustering
DeepLearning: An Intro
Convert celsius to Fahrenheit?
Easy
But wait a minute — what if you don’t know what the function is?
https://colab.research.google.com/drive/1dClkcffoJBaFwRJ37wOvaDQCSeIiC61O?userstoinvite
=shawn.chumbar%40sjsu.edu&sharingaction=manageaccess&role=writer
ML Tasks : Clustering
In this chapter we will look at the applications of Clustering, an unsupervised ML activity.
Identify the normative grouping of the data. What are the affinities in this dataset?
Example 1:
Cluster 1: == 4 years
Cluster 2: < 4 years
Cluster 3: == 4-6 years
Example 2:
Cluster 1 : people with prior ML exposure
Cluster 2: people with no prior ML exposure
Example 3, Marketing:
Demographics : urban, rural,
Soccer moms who drive SUV’s
Retired Male pop living in urban highrises
How can I group my data? In what meaningful ways can I group my data?
Introduction
Clustering is a fundamental technique in data science and machine learning, applicable to a
wide range of domains for uncovering hidden patterns in data. It's primarily used for exploratory
data analysis, pre-processing steps, and as a part of complex data processing pipelines. Below,
we explore various use cases for clustering, followed by tips, best practices, and a comparison
table to guide the selection of appropriate algorithms.
1. Customer Segmentation: Businesses can identify distinct groups within their customer base
to tailor marketing strategies, optimize product offerings, and improve customer service.
Clustering helps in understanding customer behavior, preferences, and demographic
characteristics.
2. Anomaly Detection: By clustering similar data points together, outliers or anomalies become
more apparent, aiding in fraud detection, system health monitoring, and detecting unusual
behavior in network traffic or transactions.
3. Image Segmentation: In computer vision, clustering algorithms can segment images into
constituent parts or objects, useful in medical imaging, autonomous driving, and image
compression.
5. Genomic Data Analysis: In bioinformatics, clustering is used to group genes with similar
expression patterns, which can indicate co-regulated genes or genes that contribute to similar
functions or diseases.
1. K-Means Clustering
Description: K-Means is a centroid-based algorithm, which partitions the dataset into K distinct,
non-overlapping subsets (clusters). It assigns each data point to the nearest cluster while
keeping the centroids as small as possible.
Real-world Example: Customer segmentation for targeted marketing. Businesses can use
K-Means to segment their customers based on purchase history, behavior, and preferences to
tailor marketing strategies.
# Sample dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Applying KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
Detailed Discussion
K-means is an unsupervised machine learning algorithm that partitions a set of points into K
clusters based on their proximity to the centroids of the clusters. The algorithm starts with
randomly initialized centroids and iteratively refines the position of the centroids by computing
the mean of the points in each cluster and reassigning the points to the closest centroid. The
algorithm terminates when the centroids stop changing.
How do we determine k?
How to Determine the Optimal K for K-Means? | by Khyati Mahendru | Analytics Vidhya |
Medium
2. Hierarchical Clustering
Real-world Example: Gene sequence analysis where hierarchical clustering can be used to find
groups of genes with similar expression patterns.
# Sample dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Plotting dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.show()
```
Detailed Discussion
is a type of unsupervised machine1 learning method used for cluster analysis that seeks to build
a hierarchy of clusters by creating nested clusters that successively divide the original data into
subsets. It is mainly divided into two categories: Agglomerative (bottom-up) and Divisive
(top-down) Hierarchical Clustering.
Agglomerative Hierarchical Clustering starts with individual data points as separate clusters and
merges them into larger clusters, repeating this process until a stopping criterion is met. The
stopping criterion can be the desired number of clusters or a threshold for the similarity metric
used to determine whether two clusters should be merged.
1
Supervised: X,Y → F; unsupervised no X,Y
© Dr. Ali Arsanjani, 2018-2024
10
Applied Machine Learning
Divisive Hierarchical Clustering, on the other hand, starts with all the data in one big cluster
and splits it into smaller clusters, repeating this process until a stopping criterion is met.
The similarity metric used in hierarchical clustering can be Euclidean distance, Manhattan
distance, cosine similarity, or any other suitable distance metric. The choice of the metric
depends on the nature of the data and the clustering problem at hand.
The results of hierarchical clustering are usually visualized using a dendrogram, which shows
the hierarchy of clusters and the distances between them. The dendrogram helps to choose the
appropriate number of clusters by looking at the height of the vertical lines connecting two
clusters. The number of clusters can be determined by cutting the dendrogram at a certain
height, such that all clusters below that height are merged into a single cluster.
Hierarchical clustering is a flexible method that can handle non-linear relationships between
data points and does not require the number of clusters to be specified in advance. However, it
is computationally expensive for large datasets and can also be sensitive to the choice of
similarity metric.
HC is a useful method for exploratory data analysis and can be applied in various domains,
including biology, marketing, and image analysis, etc.
Description: DBSCAN groups together closely packed points by marking as outliers points that
lie alone in low-density regions. It can find arbitrarily shaped clusters and doesn’t require the
number of clusters to be specified in advance.
# Sample dataset
X = np.array([[1, 2], [2, 2], [2, 3],
# Applying DBSCAN
db = DBSCAN(eps=3, min_samples=2).fit(X)
[Source]
Fractal dimensions: the box method is a very approximate and the GMM will provide a more exact
and powerful means of measurement.]
The fractal distance is calculated by first dividing each image into small blocks. Then, for each block
in the first image, the algorithm finds the block in the second image that is most similar to it. The
similarity between two blocks is measured by the fractal dimension of the boundary between them.
The fractal dimension is a measure of the complexity of the boundary between two blocks. A high
fractal dimension indicates a complex boundary, while a low fractal dimension indicates a simple
boundary.
The fractal distance is then calculated by averaging the similarities between all of the blocks in the
first image and the blocks in the second image.
This algorithm has been shown to be effective at measuring the similarity between images. It is
particularly useful for comparing images of natural scenes, which often have complex patterns.
Note: we will start with the distance between two images (intuition) , but generalize to two
datasets (arrays) or clusters.
Description: GMM is a probabilistic model that assumes all the data points are generated from a
mixture of a finite number of Gaussian distributions with unknown parameters.
Real-world Example: Image segmentation where GMM can be used to identify and separate
different objects in an image based on color intensity.
# Sample dataset
X = np.random.rand(300, 2)
# Applying GMM
gmm = GaussianMixture(n_components=2).fit(X)
labels = gmm.predict(X)
5. Spectral Clustering
Real-world Example: Social network analysis, where spectral clustering can be used to detect
communities based on the pattern of connections between individuals.
# Sample dataset
X = np.random.rand(100, 2)
handling tune
When selecting a clustering algorithm, consider the data structure, the scale of your dataset,
and the specific requirements of your application. For example, K-Means is efficient for large
datasets with well-separated clusters, while DBSCAN is better suited for datasets with noise and
clusters of varying shapes and densities. Hierarchical clustering is ideal for when the number of
clusters is not known in advance, allowing for a detailed dendrogram to analyze cluster
© Dr. Ali Arsanjani, 2018-2024
16
Applied Machine Learning
formation. GMM provides flexibility with overlapping clusters through a probabilistic approach,
whereas Spectral Clustering excels with non-convex clusters that are interconnected, making it
a great choice for graph-based clustering.
Advanced Clustering
Recursive Clustering
See : Stock Picks using K-Means Clustering | by Timothy Ong | uptick-blog | Medium
Narrative: Clustering is often not used or done properly. Because people may not know HOW to
use clustering to solve practical problems beyond basic segmentation.
Fractal Clustering
Objective is to find a hyper personalized (what is the reason you are doing the clustering? To
pick a optimal subset of the data points …) golden cluster.
In this section we will try to gently move into the notion of Fractal Clustering (FC).
Let’s define FC as the use of Objective Functions, Recursive Clustering [option: using Fractal
Distance instead of the common Euclidean Distance] , Golden Cluster to find the golden cluster
that best fits the joint objective functions.
K-means → Apply recursively based on “some criteria” : objective functions > return, < volatility
Fractal Distance
Clusters can come in various shapes[see image 1] . The fractal distance is a measure of the
similarity between two images/shapes/distributions two clusters. It is based on the idea that
images/distributions with similar patterns will have a small fractal distance (i.e., they are more
similar) , while images with dissimilar patterns will have a large fractal distance.
[source] Which algo? → What is your data pattern? See table above.
Which is the best algo for my dataset? IT will depend on the distribution of my data/ shape of my data.
Rely on the ability of my data to fit the objective functions that lead me to the selection of a golden
cluster.
[source ]
[source]
Recursive clustering → fractal clustering (adding the notion of shape/ distribution similarity, aka
distance , fractal replacing euclidean distance )
Let’s first explore the notion of distributions to better understand the notion of fractal distances
which is all about similarity of distribution or shapes.
Returns:
A float representing the fractal distance between the two images.
"""
# Divide each image into small blocks.
block_size = 32
image1_blocks = np.array_split(image1, image1.shape[0] // block_size)
image2_blocks = np.array_split(image2, image2.shape[0] // block_size)
# For each block in the first image, find the block in the second image that is most
similar to it.
similarities = []
for i in range(len(image1_blocks)):
similarity = 0
for j in range(len(image2_blocks)):
similarity = max(similarity, fractal_dimension(image1_blocks[i],
image2_blocks[j]))
similarities.append(similarity)
# The fractal distance is then calculated by averaging the similarities between all
of the blocks in the first image and the blocks in the second image.
return np.mean(similarities)
# The fractal dimension is then calculated as the log of the standard deviation
divided by the log of the mean.
Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.
Returns:
A float representing the fractal distance between the two images.
"""
# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)
# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.
import numpy as np
# Calculate the fractal dimension and use it to estimate the length of the
path
fractal_dim = np.log(num_boxes_with_points) / np.log(2)
fractal_length = (2 fractal_dim) * ((x2 - x1) 2 + (y2 - y1) 2) 0.5
return fractal_length
import numpy as np
import pandas as pd
This code assumes that the CSV file is named real_estate_data.csv and is located in the same
directory as the Python script. It also assumes that the CSV file has columns named latitude,
longitude, and price.
In the code above, we used the KMeans class from the sklearn.cluster module to perform
k-means clustering. We specified the number of clusters as 3, and set the init parameter to
'random' to use random initialization of the cluster centroids. We also set the algorithm
parameter to 'full' to use the full batch K-means algorithm, and specified the metric parameter as
fractal_distance to use our custom distance function.
The output of the code is a summary of the clusters, showing the mean, standard deviation,
minimum, and maximum values of the latitude, longitude, and price columns for each cluster.
You can modify this code to output the results in a different format or to use a different number
of clusters.
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances
Now, we need some data to exercise the function for fractal k-means :
import numpy as np
import pandas as pd
Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.
Returns:
A float representing the fractal distance between the two images.
"""
# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)
# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.
return np.log(fractal_dimension) / np.log(np.mean(image1))
if __name__ == "__main__":
# Load the images.
image1 = np.load("image1.png")
image2 = np.load("image2.png")
As seen above,, the fractal distance between the two images is very close to 1, which indicates that
they are very similar.
Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.
Returns:
A float representing the fractal distance between the two images.
"""
# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)
# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.
return np.log(fractal_dimension) / np.log(np.mean(image1))
if __name__ == "__main__":
# Load the images.
image1 = np.load("image1.png")
image2 = np.load("image2.png")
0.99999994
Note, the fractal distance between the two images is very close to 1, which indicates that they are
very similar.
To compare the similarity of two images, we can use the fractal distance function to calculate the
distance between the two images. The fractal distance is a measure of the similarity between two
images. It is based on the idea that images with similar patterns will have a small fractal distance,
while images with dissimilar patterns will have a large fractal distance.
The fractal distance is calculated by first dividing each image into small blocks. Then, for each block
in the first image, the algorithm finds the block in the second image that is most similar to it. The
similarity between two blocks is measured by the fractal dimension of the boundary between them.
The fractal dimension is a measure of the complexity of the boundary between two blocks. A high
fractal dimension indicates a complex boundary, while a low fractal dimension indicates a simple
boundary.
The fractal distance is then calculated by averaging the similarities between all of the blocks in the
first image and the blocks in the second image.
This algorithm has been shown to be effective at measuring the similarity between images. It is
particularly useful for comparing images of natural scenes, which often have complex patterns.
To use the fractal distance function to compare the similarity of two images, we can simply pass the
two images to the function. The function will then calculate the distance between the two images
and return it.
For example, to compare the similarity of the two images in the previous example, we can use the
following code:
distance = fractal_distance(image1, image2)
print(distance)
This code will print the distance between the two images, which is 0.99999994. This indicates that
the two images are very similar.
Recall that Euclidean distance is a measure of the distance between two points in a Euclidean
space. It is calculated by taking the square root of the sum of the squares of the differences
between the coordinates of the two points.
The fractal distance is a measure of the similarity between two images. It is based on the idea
that images with similar patterns will have a small fractal distance, while images with dissimilar
patterns will have a large fractal distance.
To use the fractal distance to replace the Euclidean distance in the computation of k-means
clustering, we can simply replace the Euclidean distance with the fractal distance in the k-means
algorithm.
The k-means algorithm is a clustering algorithm that clusters a set of data points into k
clusters. The algorithm works by first initializing k centroids, which are points that are used to
represent the clusters. Then, the algorithm iterates until the centroids converge. In each
iteration, the algorithm assigns each data point to the cluster that is closest to it.
To replace the Euclidean distance with the fractal distance in the k-means algorithm, we can
simply replace the Euclidean distance with the fractal distance in the distance function that is
used to calculate the distance between a data point and a centroid.
For example, the following is the code for the k-means algorithm that uses the Euclidean
distance:
"""
Args:
Returns:
A numpy array of labels, where each label is an integer in the range [0, k).
"""
for i in range(100):
# Calculate the distance of each data point from each cluster center.
# Assign each data point to the cluster with the nearest center.
# Update the cluster centers to be the mean of the data points in each cluster.
centroids = data[labels].mean(axis=0)
return labels
This function takes in a dataset (X), the number of clusters (k), and the initial centroids
(centroids). If the initial centroids are not provided, they are chosen at random from the
dataset. The algorithm then iterates over the dataset, calculating the distance between
each data point and the centroids, and assigning each data point to the cluster with the
closest centroid. The centroids are then updated to be the mean of the data points in
each cluster. This process is repeated until the centroids no longer change, or until the
maximum number of iterations is reached.
It then returns the cluster assignments and the final centroids. The cluster assignments
can be used to label the data points, and the centroids can be used to summarize the
clusters.
import numpy as np
if centroids is None:
centroids = X[:k]
for _ in range(iterations):
distances = np.linalg.norm(X - centroids, axis=1)
assignments = np.argmin(distances)
new_centroids = np.mean(X[assignments], axis=0)
if np.allclose(new_centroids, centroids):
break
centroids = new_centroids
# Compute the fractal dimension of the data using the Higuchi method
def higuchi_fd(X, k_max=10):
L = []
x = np.array(X)
N = len(x)
for k in range(1, k_max):
Lk = []
for m in range(0, k):
Lmk = 0
for i in range(1, int(np.floor((N - m) / k))):
Lmk += abs(x[m + i * k] - x[m + i * k - k])
Lmk = Lmk * (N - 1) / np.float(k * np.floor((N - m) / k)) / k
Lk.append(Lmk)
L.append(np.log(np.mean(Lk)))
fd = np.polyfit(np.log(np.arange(1, k_max)), np.log(L), 1)
return fd[0]
This code defines centroids as a list of five centroids and uses it to assign data points to clusters
based on their fractal dimension. The resulting clusters are plotted using a scatter plot, with
each cluster being assigned a different color according to the viridis colormap.
You can adjust the number and location of the centroids, as well as the parameters of the fractal
dimension calculation and the colormap, to suit your needs. You can also use other methods for
calculating the fractal dimension or other clustering algorithms, such as DBSCAN or KMeans, if
you prefer.
It is worth noting that the use of fractal dimensions for clustering is not as well-established as
other methods, such as distance-based methods or density-based methods. Fractal dimensions
can be sensitive to the choice of method and parameters used to calculate them, and their use
for clustering may not always yield satisfactory results.
It is important to carefully evaluate the performance of any clustering method, including the use
of fractal dimensions, to ensure that it is appropriate for your dataset and your goals.
To replace the Euclidean distance with the fractal distance, we can simply replace the Euclidean
distance with the fractal distance in the distance function:
"""
Args:
Returns:
A numpy array of labels, where each label is an integer in the range [0, k).
"""
for i in range(100):
# Calculate the distance of each data point from each cluster center.
# Assign each data point to the cluster with the nearest center.
# Update the cluster centers to be the mean of the data points in each cluster.
centroids = data[labels].mean(axis=0)
return labels
This code will produce the same clusters as the original k-means algorithm, but it will use the
fractal distance instead of the Euclidean distance.
You can also use other clustering methods, such as GMM, DBSCAN, etc.
Concepts
● Distance,
○ Euclidean Distance, Manhattan Distance, Fractal Distance, Fractal Clustering,
● Similarity,
○ Cosine Distance,
● Fractal Clustering
○ Golden Cluster, Objective Functions,
○ within-Cluster Sum of Square Errors or WSS or SSE (elbow method) ,
○ Silhouette score
● Recursive Clustering, K-means, Hierarchical Clustering, Divisive and Agglomerative.