WELCOME
to the presentation of “Group-2”
PRESENTATION
TOPIC :
Clustering
COURSE CODE : STAT-309
COURSE
TITLE : Data
Mining
COURSE
TEACHER :
AHSANUL HAQUE
LECTURER
DEPARTMENT OF
UNIVERSITY
STATISTICS OF
BARISHAL
Our Team
SMINA AHMED FARZANA TABASU
SHORMY SHARMISTHA ESHITA AKTER MIM
BISWAS SWARNA MONIRUL ISLAM
RONI
Clustering
Grouping of a particular set of
objects based on their
characteristics,
aggregating them according
to their similarities
Huge Dataset
Find common attributes - all data in the
same group have similar attributes
Clustering
Examine the data to form clusters
Entitles in the real world are very complex
Products sold on a Users of a social Readers of an online
e-commerce site media platform newspaper
Defining Characteristics Using Numbers
Ratings
Review sentiment (1-positive, 0-negative
Category (1- electronics, 2- fashion,...)
Dimensions (size,height,weight)
Color
RatScore posts, comments,likes,shares
Score every post by topic (music lovers,sports lovers)
Activity score(100 most active, 0 not active at all)
Number of connections
% profile complete
Sports Science
professional basketball teams may Health Insurance
collect the following information
about players: an actuary may collect the
Points per game
following information about
households:
Assists per game
Total number of doctor
Steals per game
REAL LIFE visits per year
EXAMPLE
Total household size
Email Marketing Total number of chronic
11
a business may collect the following conditions per household
information about consumers: Average age of household
Percentage of emails opened members
Number of clicks per email
Time spent viewing email
Basic
Features:
• ·The number of clusters is
not known
• There may not be any a
priori knowledge
concerning the clusters
• Cluster results are
dynamic.
HIERARCHICAL
AGGLOMERATIV
E CLUSTERING
(CLUSTER
• Single: nearest distance or
HAC CAN BE single linkage.
• Complete: farthest
REPRESENTED distance or complete
USING THREE linkage.
• Average: average distance
TECHNIQUES- or average linkage.
Linkage Method Merits Demerits
can separate non-elliptical
cannot separate the
shapes as long as the gap
Single clusters properly if there is
between two clusters is not
noise between clusters.
small.
does well in separating
biased towards equal
Complete clusters if there is noise
variance clusters
between clusters.
balances compactness and
Average computationally intensive
connectivity
K-Means Clustering-
· K-Means clustering is an unsupervised
iterative clustering technique.
· It partitions the given data set into k
predefined distinct clusters.
· A cluster is defined as a collection
of data points exhibiting certain similarities.
IT PARTITIONS THE DATA SET
SUCH THAT-
Each data point belongs to a cluster with the nearest
mean.
Data points belonging to one cluster have high degree of
similarity.
Data points belonging to different clusters have high
degree of dissimilarity.
K-Means Clustering
Step-01: Step-04:
Choose the number of
Algorithm Assign each data point to some
cluster.
clusters K.
A data point is assigned to that cluster
whose center is nearest to that data
point.
Step-02: K-Means
Randomly select any K data points Clustering
as cluster centers. Step-05:
Algorithm Re-compute the center of newly
Select cluster centers in such a
way that they are as farther as involves the formed clusters.
possible from each other. following steps- The center of a cluster is computed by
taking mean of all the data points
contained in that cluster.
Step-06:
Keep repeating the procedure from Step-03 to
Step-03:
Step-05 until any of the following stopping
Calculate the distance between each data
point and each cluster center.
criteria is met-
The distance may be calculated either by using Center of newly formed clusters do not change
given distance function or by using Euclidean Data points remain present in the same cluster
distance formula.
Maximum number of iterations are reached
Advantages of k-means Disadvantages
Relatively simple to implement.
Scales to large data sets. It requires to specify the number of
•
Guarantees convergence. clusters (k) in advance.
Can warm-start the positions of
centroids. • It cannot handle noisy data and outliers.
Easily adapts to new examples.
It is not suitable to identify clusters with
Generalizes to clusters of •
non-convex shapes.
different shapes and sizes, such
as elliptical clusters.