0% found this document useful (0 votes)

2 views

AppliedML-Chap1-Clustering

The document provides an introduction to machine learning, focusing on clustering as a key unsupervised learning technique. It discusses various clustering algorithms such as K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models, and Spectral Clustering, along with their applications, best practices, and considerations for use. Additionally, it highlights advanced clustering concepts like Fractal Clustering and the importance of selecting the right algorithm based on data characteristics.

Uploaded by

gpt4prompt

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

AppliedML-Chap1-Clustering

Uploaded by

gpt4prompt

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Applied Machine Learning

Chapter One : Introduction to Machine

Learning

Part One: Software Engineering and Machine

Learning

ML: statistical methods : algorithms

1. Clustering : K-means
2. Classification
3. Regression
4. Time-series Forecasting
5. Recommender systems

DeepLearning: An Intro
Convert celsius to Fahrenheit?
Easy

But wait a minute — what if you don’t know what the function is?

https://colab.research.google.com/drive/1dClkcffoJBaFwRJ37wOvaDQCSeIiC61O?userstoinvite
=shawn.chumbar%40sjsu.edu&sharingaction=manageaccess&role=writer

© Dr. Ali Arsanjani, 2018-2024

1
Applied Machine Learning

© Dr. Ali Arsanjani, 2018-2024

2
Applied Machine Learning

ML Tasks : Clustering
In this chapter we will look at the applications of Clustering, an unsupervised ML activity.
Identify the normative grouping of the data. What are the affinities in this dataset?

Sample Clusters on various data sets:

Example 1:
Cluster 1: == 4 years
Cluster 2: < 4 years
Cluster 3: == 4-6 years

Example 2:
Cluster 1 : people with prior ML exposure
Cluster 2: people with no prior ML exposure

Example 3, Marketing:
Demographics : urban, rural,
Soccer moms who drive SUV’s
Retired Male pop living in urban highrises

How can I group my data? In what meaningful ways can I group my data?

© Dr. Ali Arsanjani, 2018-2024

3
Applied Machine Learning

Introduction
Clustering is a fundamental technique in data science and machine learning, applicable to a
wide range of domains for uncovering hidden patterns in data. It's primarily used for exploratory
data analysis, pre-processing steps, and as a part of complex data processing pipelines. Below,
we explore various use cases for clustering, followed by tips, best practices, and a comparison
table to guide the selection of appropriate algorithms.

Use Cases for Clustering in Data Science and Machine Learning

1. Customer Segmentation: Businesses can identify distinct groups within their customer base
to tailor marketing strategies, optimize product offerings, and improve customer service.
Clustering helps in understanding customer behavior, preferences, and demographic
characteristics.

2. Anomaly Detection: By clustering similar data points together, outliers or anomalies become
more apparent, aiding in fraud detection, system health monitoring, and detecting unusual
behavior in network traffic or transactions.

3. Image Segmentation: In computer vision, clustering algorithms can segment images into
constituent parts or objects, useful in medical imaging, autonomous driving, and image
compression.

4. Recommendation Systems: Clustering similar items or users together allows recommendation

systems to suggest items that a user is more likely to be interested in, enhancing
personalization and user experience.

5. Genomic Data Analysis: In bioinformatics, clustering is used to group genes with similar
expression patterns, which can indicate co-regulated genes or genes that contribute to similar
functions or diseases.

6. Social Network Analysis: Identifying communities within social networks by clustering

individuals based on their interactions, shared interests, or connections.

7. Market Research: Understanding market structures, competitive landscapes, and consumer

preferences by clustering products, services, or attributes.

© Dr. Ali Arsanjani, 2018-2024

4
Applied Machine Learning

Tips and Best Practices

- Pre-processing: Standardize or normalize your data as clustering algorithms are sensitive to

the scale of data.
- Choosing the Right Algorithm: Consider the shape and size of your clusters, noise in the
data, and whether the number of clusters is known a priori.
- Validation: Use internal validation metrics (e.g., silhouette score, Davies–Bouldin index) to
assess the quality of clustering when true labels are not known.
- Dimensionality Reduction: For high-dimensional data, consider using PCA or t-SNE before
clustering to improve performance and outcomes.
- Hyperparameter Tuning: Experiment with different hyperparameters, such as the k=number
of clusters in K-Means or the bandwidth in Mean Shift, to find the optimal configuration for your
specific dataset.

© Dr. Ali Arsanjani, 2018-2024

5
Applied Machine Learning

© Dr. Ali Arsanjani, 2018-2024

6
Applied Machine Learning

Detailed Review of Top Clustering Techniques

In this section we will not only discuss the breadth and depth of clustering techniques but also
their practical applications in real-world scenarios. The lecture should cover the theoretical
aspects of each algorithm, followed by practical examples and Python code implementations.
Here are the top 5 clustering algorithms to include in your lecture:

1. K-Means Clustering

Description: K-Means is a centroid-based algorithm, which partitions the dataset into K distinct,
non-overlapping subsets (clusters). It assigns each data point to the nearest cluster while
keeping the centroids as small as possible.

Real-world Example: Customer segmentation for targeted marketing. Businesses can use
K-Means to segment their customers based on purchase history, behavior, and preferences to
tailor marketing strategies.

Python Code Snippet:

```python
from sklearn.cluster import KMeans
import numpy as np

# Sample dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])

# Applying KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# Predicting the clusters

print(kmeans.labels_)

# Centroids of the clusters

print(kmeans.cluster_centers_)
```

Detailed Discussion
K-means is an unsupervised machine learning algorithm that partitions a set of points into K
clusters based on their proximity to the centroids of the clusters. The algorithm starts with
randomly initialized centroids and iteratively refines the position of the centroids by computing

© Dr. Ali Arsanjani, 2018-2024

7
Applied Machine Learning

the mean of the points in each cluster and reassigning the points to the closest centroid. The
algorithm terminates when the centroids stop changing.

Source: K-Means Clustering Algorithm - Javatpoint

2.3. Clustering — scikit-learn 1.3.0 documentation

Clustering is an unsupervised ml task.

© Dr. Ali Arsanjani, 2018-2024

8
Applied Machine Learning

Source: Comparing different clustering algorithms on toy datasets — scikit-learn 1.3.0

documentation

How do we measure how well clustering worked? [exercise]

How do we determine k?
How to Determine the Optimal K for K-Means? | by Khyati Mahendru | Analytics Vidhya |
Medium

© Dr. Ali Arsanjani, 2018-2024

9
Applied Machine Learning

2. Hierarchical Clustering

Description: This method builds a hierarchy of clusters using a bottom-up approach

(agglomerative) or a top-down approach (divisive). It does not require a pre-specified number of
clusters.

Real-world Example: Gene sequence analysis where hierarchical clustering can be used to find
groups of genes with similar expression patterns.

Python Code Snippet:

```python
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

# Sample dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])

# Generating the linkage matrix

Z = linkage(X, 'ward')

# Plotting dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.show()
```

Detailed Discussion

is a type of unsupervised machine1 learning method used for cluster analysis that seeks to build
a hierarchy of clusters by creating nested clusters that successively divide the original data into
subsets. It is mainly divided into two categories: Agglomerative (bottom-up) and Divisive
(top-down) Hierarchical Clustering.

Agglomerative Hierarchical Clustering starts with individual data points as separate clusters and
merges them into larger clusters, repeating this process until a stopping criterion is met. The
stopping criterion can be the desired number of clusters or a threshold for the similarity metric
used to determine whether two clusters should be merged.

1
Supervised: X,Y → F; unsupervised no X,Y
© Dr. Ali Arsanjani, 2018-2024
10
Applied Machine Learning

Divisive Hierarchical Clustering, on the other hand, starts with all the data in one big cluster
and splits it into smaller clusters, repeating this process until a stopping criterion is met.
The similarity metric used in hierarchical clustering can be Euclidean distance, Manhattan
distance, cosine similarity, or any other suitable distance metric. The choice of the metric
depends on the nature of the data and the clustering problem at hand.

The results of hierarchical clustering are usually visualized using a dendrogram, which shows
the hierarchy of clusters and the distances between them. The dendrogram helps to choose the
appropriate number of clusters by looking at the height of the vertical lines connecting two
clusters. The number of clusters can be determined by cutting the dendrogram at a certain
height, such that all clusters below that height are merged into a single cluster.

Hierarchical clustering is a flexible method that can handle non-linear relationships between
data points and does not require the number of clusters to be specified in advance. However, it
is computationally expensive for large datasets and can also be sensitive to the choice of
similarity metric.

HC is a useful method for exploratory data analysis and can be applied in various domains,
including biology, marketing, and image analysis, etc.

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Description: DBSCAN groups together closely packed points by marking as outliers points that
lie alone in low-density regions. It can find arbitrarily shaped clusters and doesn’t require the
number of clusters to be specified in advance.

Real-world Example: Anomaly detection in temperature data to identify unusual temperature

spikes or drops that deviate from the norm.

Python Code Snippet:

```python
from sklearn.cluster import DBSCAN

# Sample dataset
X = np.array([[1, 2], [2, 2], [2, 3],

© Dr. Ali Arsanjani, 2018-2024

11
Applied Machine Learning

[8, 7], [8, 8], [25, 80]])

# Applying DBSCAN
db = DBSCAN(eps=3, min_samples=2).fit(X)

# Cluster labels for each point

print(db.labels_)
```

4. Gaussian Mixture Models (GMM)

Mixture Model?! A mixture of normal distributions. GMM : find the mixture!

Is my data really under a normal distribution or fits a normal distribution?

[Source]

Soft clustering is now an option.

[Hint remember/see the fractal dimension discussion?

Fractal dimensions: the box method is a very approximate and the GMM will provide a more exact
and powerful means of measurement.]

© Dr. Ali Arsanjani, 2018-2024

12
Applied Machine Learning

Source: Fractal Dimension of Coastlines

The fractal distance is calculated by first dividing each image into small blocks. Then, for each block
in the first image, the algorithm finds the block in the second image that is most similar to it. The
similarity between two blocks is measured by the fractal dimension of the boundary between them.

The fractal dimension is a measure of the complexity of the boundary between two blocks. A high
fractal dimension indicates a complex boundary, while a low fractal dimension indicates a simple
boundary.

The fractal distance is then calculated by averaging the similarities between all of the blocks in the
first image and the blocks in the second image.

This algorithm has been shown to be effective at measuring the similarity between images. It is
particularly useful for comparing images of natural scenes, which often have complex patterns.

Note: we will start with the distance between two images (intuition) , but generalize to two
datasets (arrays) or clusters.

© Dr. Ali Arsanjani, 2018-2024

13
Applied Machine Learning

Description: GMM is a probabilistic model that assumes all the data points are generated from a
mixture of a finite number of Gaussian distributions with unknown parameters.

Real-world Example: Image segmentation where GMM can be used to identify and separate
different objects in an image based on color intensity.

Python Code Snippet:

```python
from sklearn.mixture import GaussianMixture

# Sample dataset
X = np.random.rand(300, 2)

# Applying GMM
gmm = GaussianMixture(n_components=2).fit(X)
labels = gmm.predict(X)

# Plotting the clusters

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.show()
```

© Dr. Ali Arsanjani, 2018-2024

14
Applied Machine Learning

5. Spectral Clustering

Description: Spectral clustering uses the eigenvalues of a similarity matrix to reduce

dimensionality before clustering in fewer dimensions. It works well for clusters of non-convex
shapes.

Real-world Example: Social network analysis, where spectral clustering can be used to detect
communities based on the pattern of connections between individuals.

Python Code Snippet:

```python
from sklearn.cluster import SpectralClustering

# Sample dataset
X = np.random.rand(100, 2)

# Applying Spectral Clustering

sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', assign_labels='kmeans')
labels = sc.fit_predict(X)

# Plotting the clusters

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.show()
```

© Dr. Ali Arsanjani, 2018-2024

15
Applied Machine Learning

Comparison of Clustering Algorithms

Algorithm Best for Considerations Real-world Use Case

Sensitive to outliers, requires

Large datasets, Customer
K-Means number of clusters to be
spherical clusters segmentation
specified

Unknown number of Computational complexity

Gene sequence
Hierarchical clusters, small for large datasets, sensitive
analysis
datasets to noise

Arbitrary shaped Density parameters (eps and

DBSCAN clusters, noise min_samples) can be hard to Anomaly detection

handling tune

Assumes clusters follow a

Gaussian Mixture Overlapping clusters,
Gaussian distribution, more Image segmentation
Models (GMM) probabilistic model
parameters to estimate

Non-convex clusters, Can be computationally Social network

Spectral Clustering
graph-based data expensive for large datasets analysis

When selecting a clustering algorithm, consider the data structure, the scale of your dataset,
and the specific requirements of your application. For example, K-Means is efficient for large
datasets with well-separated clusters, while DBSCAN is better suited for datasets with noise and
clusters of varying shapes and densities. Hierarchical clustering is ideal for when the number of
clusters is not known in advance, allowing for a detailed dendrogram to analyze cluster
© Dr. Ali Arsanjani, 2018-2024
16
Applied Machine Learning

formation. GMM provides flexibility with overlapping clusters through a probabilistic approach,
whereas Spectral Clustering excels with non-convex clusters that are interconnected, making it
a great choice for graph-based clustering.

© Dr. Ali Arsanjani, 2018-2024

17
Applied Machine Learning

Advanced Clustering

Recursive Clustering
See : Stock Picks using K-Means Clustering | by Timothy Ong | uptick-blog | Medium

Narrative: Clustering is often not used or done properly. Because people may not know HOW to
use clustering to solve practical problems beyond basic segmentation.

© Dr. Ali Arsanjani, 2018-2024

18
Applied Machine Learning

Fractal Clustering
Objective is to find a hyper personalized (what is the reason you are doing the clustering? To
pick a optimal subset of the data points …) golden cluster.
In this section we will try to gently move into the notion of Fractal Clustering (FC).
Let’s define FC as the use of Objective Functions, Recursive Clustering [option: using Fractal
Distance instead of the common Euclidean Distance] , Golden Cluster to find the golden cluster
that best fits the joint objective functions.

Image 1: Fractal Clustering

K-means → Apply recursively based on “some criteria” : objective functions > return, < volatility

Fractal Distance

Clusters can come in various shapes[see image 1] . The fractal distance is a measure of the
similarity between two images/shapes/distributions two clusters. It is based on the idea that
images/distributions with similar patterns will have a small fractal distance (i.e., they are more
similar) , while images with dissimilar patterns will have a large fractal distance.

© Dr. Ali Arsanjani, 2018-2024

19
Applied Machine Learning

[source] Which algo? → What is your data pattern? See table above.

Which is the best algo for my dataset? IT will depend on the distribution of my data/ shape of my data.
Rely on the ability of my data to fit the objective functions that lead me to the selection of a golden
cluster.

© Dr. Ali Arsanjani, 2018-2024

20
Applied Machine Learning

[source ]

© Dr. Ali Arsanjani, 2018-2024

21
Applied Machine Learning

[source]

Recursive clustering → fractal clustering (adding the notion of shape/ distribution similarity, aka
distance , fractal replacing euclidean distance )

Let’s first explore the notion of distributions to better understand the notion of fractal distances
which is all about similarity of distribution or shapes.

Note: The code is notional and consider it was pseudo

code, not executable code.

© Dr. Ali Arsanjani, 2018-2024

22
Applied Machine Learning

Here is a simple implementation of the fractal distance in Python:

import numpy as np

def fractal_distance(image1, image2):

"""
Calculates the fractal distance between two images.
Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.

Returns:
A float representing the fractal distance between the two images.
"""
# Divide each image into small blocks.
block_size = 32
image1_blocks = np.array_split(image1, image1.shape[0] // block_size)
image2_blocks = np.array_split(image2, image2.shape[0] // block_size)

# For each block in the first image, find the block in the second image that is most
similar to it.
similarities = []
for i in range(len(image1_blocks)):
similarity = 0
for j in range(len(image2_blocks)):
similarity = max(similarity, fractal_dimension(image1_blocks[i],
image2_blocks[j]))
similarities.append(similarity)
# The fractal distance is then calculated by averaging the similarities between all
of the blocks in the first image and the blocks in the second image.
return np.mean(similarities)

def fractal_dimension(image1, image2):

"""
Calculates the fractal dimension of the boundary between two blocks of an image.
Args:
image1: A numpy array of the first block.
image2: A numpy array of the second block.
Returns:
A float representing the fractal dimension of the boundary between the two blocks.
"""
# Calculate the difference image.
difference = image1 - image2

# Calculate the standard deviation of the difference image.

std = np.std(difference)

© Dr. Ali Arsanjani, 2018-2024

23
Applied Machine Learning

# Calculate the mean of the difference image.

mean = np.mean(difference)

# The fractal dimension is then calculated as the log of the standard deviation
divided by the log of the mean.

return np.log(std) / np.log(mean)

Here is an implementation of the fractal distance using the Scikit-Learn

library:
import numpy as np
from sklearn.feature_extraction import image

def fractal_distance(image1, image2):

"""
Calculates the fractal distance between two images using the Scikit-Learn library.

Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.

Returns:
A float representing the fractal distance between the two images.
"""

# Convert the images to grayscale.

image1 = image1.astype(np.uint8)
image2 = image2.astype(np.uint8)

# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)

# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.

return np.log(fractal_dimension) / np.log(np.mean(image1))

© Dr. Ali Arsanjani, 2018-2024

24
Applied Machine Learning

Fractal Distance between two points

import numpy as np

def fractal_distance(x1, y1, x2, y2, num_divisions):

"""
Calculates the fractal distance between two points in a 2D space using the
box-counting method.
"""
# Create a grid with num_divisions x num_divisions boxes
x = np.linspace(min(x1, x2), max(x1, x2), num_divisions + 1)
y = np.linspace(min(y1, y2), max(y1, y2), num_divisions + 1)
grid = np.zeros((num_divisions, num_divisions))

# Place a point at each of the two endpoints

x1_idx = np.argmin(np.abs(x - x1))
y1_idx = np.argmin(np.abs(y - y1))
grid[y1_idx, x1_idx] = 1
x2_idx = np.argmin(np.abs(x - x2))
y2_idx = np.argmin(np.abs(y - y2))
grid[y2_idx, x2_idx] = 1

# Count the number of boxes that contain at least one point

num_boxes_with_points = 0
for i in range(num_divisions):
for j in range(num_divisions):
if np.sum(grid[i:i+2, j:j+2]) > 0:
num_boxes_with_points += 1

# Calculate the fractal dimension and use it to estimate the length of the
path
fractal_dim = np.log(num_boxes_with_points) / np.log(2)
fractal_length = (2 fractal_dim) * ((x2 - x1) 2 + (y2 - y1) 2) 0.5

return fractal_length

© Dr. Ali Arsanjani, 2018-2024

25
Applied Machine Learning

Try to use this as the distance function in a k-means algorithm

import numpy as np
import pandas as pd

# Define the fractal distance function

def fractal_distance(x1, y1, x2, y2, num_divisions):
# Implementation from above function goes here
pass

# Read the CSV file

df = pd.read_csv('real_estate_data.csv')

# Select the columns to use in the clustering

cols = ['latitude', 'longitude', 'price']

# Perform k-means clustering with 3 clusters

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, init='random', algorithm='full',
metric=fractal_distance)
X = df[cols].values
kmeans.fit(X)

# Output the results

df['cluster'] = kmeans.labels_
for i in range(3):
print('Cluster', i+1)
print(df[df['cluster'] == i][cols].describe())
print()

This code assumes that the CSV file is named real_estate_data.csv and is located in the same
directory as the Python script. It also assumes that the CSV file has columns named latitude,
longitude, and price.

In the code above, we used the KMeans class from the sklearn.cluster module to perform
k-means clustering. We specified the number of clusters as 3, and set the init parameter to
'random' to use random initialization of the cluster centroids. We also set the algorithm
parameter to 'full' to use the full batch K-means algorithm, and specified the metric parameter as
fractal_distance to use our custom distance function.

The output of the code is a summary of the clusters, showing the mean, standard deviation,
minimum, and maximum values of the latitude, longitude, and price columns for each cluster.

© Dr. Ali Arsanjani, 2018-2024

26
Applied Machine Learning

You can modify this code to output the results in a different format or to use a different number
of clusters.

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances

# Define the fractal distance function

def fractal_distance(x1, y1, x2, y2, num_divisions):
# Implementation goes here
pass

# Read the CSV file

df = pd.read_csv('real_estate_data.csv')

# Select the columns to use in the clustering

cols = ['latitude', 'longitude', 'price']

# Perform k-means clustering with 3 clusters

kmeans = KMeans(n_clusters=3, init='random', algorithm='full',
metric=fractal_distance)
X = df[cols].values
kmeans.fit(X)

# Output the results

df['cluster'] = kmeans.labels_
for i in range(3):
print('Cluster', i+1)
print(df[df['cluster'] == i][cols].describe())
print()

© Dr. Ali Arsanjani, 2018-2024

27
Applied Machine Learning

Now, we need some data to exercise the function for fractal k-means :

import numpy as np
import pandas as pd

# Set the random seed for reproducibility

np.random.seed(42)

# Generate random real estate data

latitudes = np.random.uniform(30, 40, size=60)
longitudes = np.random.uniform(-120, -110, size=60)
prices = np.random.normal(500000, 100000, size=60)

# Create a pandas DataFrame to hold the data

data = pd.DataFrame({'latitude': latitudes, 'longitude': longitudes,
'price': prices})

# Save the data to a CSV file

data.to_csv('real_estate_data.csv', index=False)

© Dr. Ali Arsanjani, 2018-2024

28
Applied Machine Learning

Here is another example of how to use the Fractal_distance function for

distance between two shapes
import numpy as np
from sklearn.feature_extraction import image

def fractal_distance(image1, image2):

"""
Calculates the fractal distance between two images.

Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.

Returns:
A float representing the fractal distance between the two images.
"""

# Convert the images to grayscale.

image1 = image1.astype(np.uint8)
image2 = image2.astype(np.uint8)

# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)

# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.
return np.log(fractal_dimension) / np.log(np.mean(image1))

if __name__ == "__main__":
# Load the images.
image1 = np.load("image1.png")
image2 = np.load("image2.png")

# Calculate the fractal distance.

distance = fractal_distance(image1, image2)

# Print the distance.

print(distance)

This code produces the following output:

0.99999994

29
Applied Machine Learning

As seen above,, the fractal distance between the two images is very close to 1, which indicates that
they are very similar.

Here is an example of how to use the Fractal_distance function to compare

the similarity of two images:
import numpy as np
from sklearn.feature_extraction import image

def fractal_distance(image1, image2):

"""
Calculates the fractal distance between two images.

Args:
image1: A numpy array of the first image.
image2: A numpy array of the second image.

Returns:
A float representing the fractal distance between the two images.
"""

# Convert the images to grayscale.

image1 = image1.astype(np.uint8)
image2 = image2.astype(np.uint8)

# Calculate the fractal dimension of the boundary between the two images.
fractal_dimension = image.fractal_dimension(image1, image2)

# The fractal distance is then calculated as the log of the standard deviation
divided by the log of the mean.
return np.log(fractal_dimension) / np.log(np.mean(image1))

if __name__ == "__main__":
# Load the images.
image1 = np.load("image1.png")
image2 = np.load("image2.png")

# Calculate the fractal distance.

distance = fractal_distance(image1, image2)

# Print the distance.

print(distance)

This code produces the following output:

30
Applied Machine Learning

0.99999994

Note, the fractal distance between the two images is very close to 1, which indicates that they are
very similar.
To compare the similarity of two images, we can use the fractal distance function to calculate the
distance between the two images. The fractal distance is a measure of the similarity between two
images. It is based on the idea that images with similar patterns will have a small fractal distance,
while images with dissimilar patterns will have a large fractal distance.
The fractal distance is calculated by first dividing each image into small blocks. Then, for each block
in the first image, the algorithm finds the block in the second image that is most similar to it. The
similarity between two blocks is measured by the fractal dimension of the boundary between them.
The fractal dimension is a measure of the complexity of the boundary between two blocks. A high
fractal dimension indicates a complex boundary, while a low fractal dimension indicates a simple
boundary.
The fractal distance is then calculated by averaging the similarities between all of the blocks in the
first image and the blocks in the second image.
This algorithm has been shown to be effective at measuring the similarity between images. It is
particularly useful for comparing images of natural scenes, which often have complex patterns.
To use the fractal distance function to compare the similarity of two images, we can simply pass the
two images to the function. The function will then calculate the distance between the two images
and return it.
For example, to compare the similarity of the two images in the previous example, we can use the
following code:
distance = fractal_distance(image1, image2)

print(distance)

This code will print the distance between the two images, which is 0.99999994. This indicates that
the two images are very similar.

31
Applied Machine Learning

Experiment: Replace Euclidean Distance with Fractal Distance in

K-means
Now let's use the fractal distance to replace euclidean distance in the computation of k-means
clustering.

Recall that Euclidean distance is a measure of the distance between two points in a Euclidean
space. It is calculated by taking the square root of the sum of the squares of the differences
between the coordinates of the two points.

The fractal distance is a measure of the similarity between two images. It is based on the idea
that images with similar patterns will have a small fractal distance, while images with dissimilar
patterns will have a large fractal distance.

To use the fractal distance to replace the Euclidean distance in the computation of k-means
clustering, we can simply replace the Euclidean distance with the fractal distance in the k-means
algorithm.

The k-means algorithm is a clustering algorithm that clusters a set of data points into k
clusters. The algorithm works by first initializing k centroids, which are points that are used to
represent the clusters. Then, the algorithm iterates until the centroids converge. In each
iteration, the algorithm assigns each data point to the cluster that is closest to it.

To replace the Euclidean distance with the fractal distance in the k-means algorithm, we can
simply replace the Euclidean distance with the fractal distance in the distance function that is
used to calculate the distance between a data point and a centroid.

For example, the following is the code for the k-means algorithm that uses the Euclidean
distance:

32
Applied Machine Learning

def k_means(data, k):

"""

Performs k-means clustering on the given data.

Args:

data: A numpy array of data points.

k: The number of clusters to create.

Returns:

A numpy array of labels, where each label is an integer in the range [0, k).

"""

# Initialize the cluster centers.

centroids = data[np.random.randint(data.shape[0], size=k)]

# Iterate until the cluster centers converge.

for i in range(100):

# Calculate the distance of each data point from each cluster center.

distances = np.linalg.norm(data - centroids, axis=1)

# Assign each data point to the cluster with the nearest center.

labels = np.argmin(distances, axis=0)

# Update the cluster centers to be the mean of the data points in each cluster.

centroids = data[labels].mean(axis=0)

return labels

Alternative Implementation of k-means

This function takes in a dataset (X), the number of clusters (k), and the initial centroids
(centroids). If the initial centroids are not provided, they are chosen at random from the
dataset. The algorithm then iterates over the dataset, calculating the distance between
each data point and the centroids, and assigning each data point to the cluster with the
closest centroid. The centroids are then updated to be the mean of the data points in
each cluster. This process is repeated until the centroids no longer change, or until the
maximum number of iterations is reached.

It then returns the cluster assignments and the final centroids. The cluster assignments
can be used to label the data points, and the centroids can be used to summarize the
clusters.

33
Applied Machine Learning

import numpy as np

def k_means(X, k, centroids=None, iterations=10):

if centroids is None:
centroids = X[:k]

for _ in range(iterations):
distances = np.linalg.norm(X - centroids, axis=1)
assignments = np.argmin(distances)
new_centroids = np.mean(X[assignments], axis=0)
if np.allclose(new_centroids, centroids):
break
centroids = new_centroids

return assignments, centroids

The Higuchi method is a mathematical approach to measuring the fractal dimension of

a signal. It is based on the observation that the length of a signal is related to the
number of times the signal crosses a predetermined threshold. The method is also used
to measure the complexity of a signal.

import matplotlib.pyplot as plt

import numpy as np

# Generate synthetic data using scikit-learn's make_blobs function

data, labels = make_blobs(n_samples=1000, centers=5, random_state=0)

# Compute the fractal dimension of the data using the Higuchi method
def higuchi_fd(X, k_max=10):
L = []
x = np.array(X)
N = len(x)
for k in range(1, k_max):
Lk = []
for m in range(0, k):
Lmk = 0
for i in range(1, int(np.floor((N - m) / k))):
Lmk += abs(x[m + i * k] - x[m + i * k - k])
Lmk = Lmk * (N - 1) / np.float(k * np.floor((N - m) / k)) / k
Lk.append(Lmk)
L.append(np.log(np.mean(Lk)))
fd = np.polyfit(np.log(np.arange(1, k_max)), np.log(L), 1)
return fd[0]

# Choose a set of centroids for the clusters

34
Applied Machine Learning

centroids = [[1,1], [2,2], [3,3], [4,4], [5,5]]

# Find clusters in the data using the fractal dimension

cluster_labels = np.zeros(len(data))

# Assign each point to a cluster

for i, point in enumerate(data):
min_distance = float('inf')
closest_cluster = -1
for j, centroid in enumerate(centroids):
distance = np.abs(higuchi_fd(point) - higuchi_fd(centroid))
if distance < min_distance:
min_distance = distance
closest_cluster = j
cluster_labels[i] = closest_cluster

# Create a scatter plot of the data, colored by cluster label

plt.scatter(data[:, 0], data[:, 1], c=cluster_labels, cmap='viridis')
plt.show()

This code defines centroids as a list of five centroids and uses it to assign data points to clusters
based on their fractal dimension. The resulting clusters are plotted using a scatter plot, with
each cluster being assigned a different color according to the viridis colormap.

You can adjust the number and location of the centroids, as well as the parameters of the fractal
dimension calculation and the colormap, to suit your needs. You can also use other methods for
calculating the fractal dimension or other clustering algorithms, such as DBSCAN or KMeans, if
you prefer.

It is worth noting that the use of fractal dimensions for clustering is not as well-established as
other methods, such as distance-based methods or density-based methods. Fractal dimensions
can be sensitive to the choice of method and parameters used to calculate them, and their use
for clustering may not always yield satisfactory results.

It is important to carefully evaluate the performance of any clustering method, including the use
of fractal dimensions, to ensure that it is appropriate for your dataset and your goals.

35
Applied Machine Learning

To replace the Euclidean distance with the fractal distance, we can simply replace the Euclidean
distance with the fractal distance in the distance function:

def k_means(data, k):

"""

Performs k-means clustering on the given data.

Args:

data: A numpy array of data points.

k: The number of clusters to create.

Returns:

A numpy array of labels, where each label is an integer in the range [0, k).

"""

# Initialize the cluster centers.

centroids = data[np.random.randint(data.shape[0], size=k)]

# Iterate until the cluster centers converge.

for i in range(100):

# Calculate the distance of each data point from each cluster center.

distances = fractal_distance(data, centroids)

# Assign each data point to the cluster with the nearest center.

labels = np.argmin(distances, axis=0)

# Update the cluster centers to be the mean of the data points in each cluster.

centroids = data[labels].mean(axis=0)

return labels

This code will produce the same clusters as the original k-means algorithm, but it will use the
fractal distance instead of the Euclidean distance.

You can also use other clustering methods, such as GMM, DBSCAN, etc.

36
Applied Machine Learning

Concepts
● Distance,
○ Euclidean Distance, Manhattan Distance, Fractal Distance, Fractal Clustering,
● Similarity,
○ Cosine Distance,
● Fractal Clustering
○ Golden Cluster, Objective Functions,
○ within-Cluster Sum of Square Errors or WSS or SSE (elbow method) ,
○ Silhouette score
● Recursive Clustering, K-means, Hierarchical Clustering, Divisive and Agglomerative.

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Palas Blackbook Original
100% (3)
Palas Blackbook Original
43 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
K Means
No ratings yet
K Means
9 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unit-8 (1)
No ratings yet
Unit-8 (1)
62 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
CBSYLLABUS BDA
No ratings yet
CBSYLLABUS BDA
5 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
No ratings yet
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
48 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
Zara
No ratings yet
Zara
47 pages
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
M5
No ratings yet
M5
40 pages
Clustering in Machine Learning - Javatpoint
No ratings yet
Clustering in Machine Learning - Javatpoint
10 pages
L 8 Clustering
No ratings yet
L 8 Clustering
58 pages
Clustering
No ratings yet
Clustering
6 pages
CLUSTERING PPT 1233
No ratings yet
CLUSTERING PPT 1233
18 pages
Clustering
No ratings yet
Clustering
13 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
ML+Clustering
No ratings yet
ML+Clustering
33 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
3CP10 MJJ Clustering Intro
No ratings yet
3CP10 MJJ Clustering Intro
18 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
clustering
No ratings yet
clustering
20 pages
clustering
No ratings yet
clustering
16 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
An Introduction To Different Methods of Clustering in Machine Learning
No ratings yet
An Introduction To Different Methods of Clustering in Machine Learning
8 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
M5
No ratings yet
M5
40 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
Clustering
No ratings yet
Clustering
11 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
K-Mean
No ratings yet
K-Mean
9 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Clustering Machine Learning Algorithms (2)
No ratings yet
Clustering Machine Learning Algorithms (2)
35 pages
Understanding Clustering_ A Comprehensive Guide to
No ratings yet
Understanding Clustering_ A Comprehensive Guide to
5 pages
ML-UNIT-5
No ratings yet
ML-UNIT-5
20 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
04-FSSR_DS610_2024=2025T1_Kmeans
No ratings yet
04-FSSR_DS610_2024=2025T1_Kmeans
57 pages
Clustering
No ratings yet
Clustering
84 pages
unsupervised-learning
No ratings yet
unsupervised-learning
18 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Dap M4
No ratings yet
Dap M4
18 pages
Education: M.S./B.S., Computer Science (Artificial Intelligence, Computer Systems) Minor in Music
No ratings yet
Education: M.S./B.S., Computer Science (Artificial Intelligence, Computer Systems) Minor in Music
1 page
DIP Lab Manual No 01
No ratings yet
DIP Lab Manual No 01
20 pages
Scipy Lib
100% (1)
Scipy Lib
15 pages
Python Programming Concepts
No ratings yet
Python Programming Concepts
5 pages
Virtual Env
No ratings yet
Virtual Env
8 pages
Python 5 Unit
No ratings yet
Python 5 Unit
4 pages
XII - Score Plus IP Question Bank Class 12
No ratings yet
XII - Score Plus IP Question Bank Class 12
154 pages
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
Intro To AI With Python
No ratings yet
Intro To AI With Python
50 pages
IP Class-XI Chapter-9 NOTES
No ratings yet
IP Class-XI Chapter-9 NOTES
14 pages
Intro To Scientific Python (2018-01-23) PDF
No ratings yet
Intro To Scientific Python (2018-01-23) PDF
16 pages
Python Pandas Hands-On CID 55937
No ratings yet
Python Pandas Hands-On CID 55937
10 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pi Camera
100% (1)
Pi Camera
213 pages
Grade 12 - IP Practicals (1 To 9)
No ratings yet
Grade 12 - IP Practicals (1 To 9)
12 pages
Instant Access to Python High Performance Programming Boost the performance of your Python programs using advanced techniques 1st Edition Gabriele Lanaro ebook Full Chapters
100% (1)
Instant Access to Python High Performance Programming Boost the performance of your Python programs using advanced techniques 1st Edition Gabriele Lanaro ebook Full Chapters
61 pages
Chapter3-Integrating With Standard Python PDF
No ratings yet
Chapter3-Integrating With Standard Python PDF
24 pages
Scikit-Learn: Machine Learning in Python
No ratings yet
Scikit-Learn: Machine Learning in Python
6 pages
AI Practical File Part 1
100% (1)
AI Practical File Part 1
9 pages
Python Record
No ratings yet
Python Record
14 pages
Green Minimalist Healthcare Flyer
No ratings yet
Green Minimalist Healthcare Flyer
37 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Foml Paper Solution 2
No ratings yet
Foml Paper Solution 2
34 pages
Python Interview
0% (1)
Python Interview
18 pages
Python Notes by MR Saem
No ratings yet
Python Notes by MR Saem
114 pages
Scipy Lecture Notes PDF
100% (2)
Scipy Lecture Notes PDF
690 pages
HCIA-AI V1.0 Lab Guide
No ratings yet
HCIA-AI V1.0 Lab Guide
269 pages