We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20
ML UCS-401
Topics: Distance Based Models Neighbours and
Examples, Nearest Neighbours Classification, Distance based clustering-K means Algorithm, Hierarchical clustering Distance Based Models Neighbours and Examples
Distance-based models in machine learning rely on
the concept of measuring the distance or similarity between data points to make predictions, cluster data, or retrieve relevant examples. These models assume that similar data points are close to each other in the feature space. Distance-Based Models: 1. k-Nearest Neighbors (k-NN): A non-parametric classification and regression algorithm. •Predicts based on the labels or values of the nearest k points in the feature space. •Common distance metrics: •Euclidean distance •Manhattan distance •Minkowski distance •Cosine similarity 2. Radius Neighbors: Similar to k-NN but uses all neighbors within a predefined radius for predictions instead of a fixed k. 3. Support Vector Machines (SVM) with RBF Kernel: Uses distance-based measures in its kernel functions (e.g., Radial Basis Function) to find optimal decision boundaries. 4. Clustering Algorithms: •K-Means: Clusters points by minimizing within- cluster distances to centroids. •DBSCAN: Groups points based on density and distance. •Agglomerative Clustering: Forms clusters by merging points or clusters based on linkage distance. 5. Self-Organizing Maps (SOM): An unsupervised learning method using distance-based mapping to reduce dimensions and visualize data. 6. Instance-Based Learning Algorithms: Learning by memorizing instances and using distance metrics for prediction (e.g., Locally Weighted Regression). Examples of Applications: 1. k-NN in Image Recognition Classifying images based on the majority label of k nearest images in feature space 2. Recommendation Systems:Using cosine similarity or Euclidean distance to recommend similar items. 3. Customer Segmentation with K-Means: Grouping customers based on purchasing behavior or demographics. 4. Anomaly Detection with DBSCAN:Identifying outliers as points with low-density neighborhoods. 5. RBF Kernel in SVM for Classification:Mapping non- linear data into a higher dimension using distance metrics for better separation. Nearest Neighbours Classification Nearest Neighbors Classification is a simple, instance-based learning algorithm that classifies a data point based on the majority label of its closest neighbors in the feature space. It assumes that similar data points are near each other. How it Works: 1. Training Phase:There is no explicit training; the algorithm simply stores the dataset. 2. Prediction Phase: For a new data point: •Compute the distance between the point and all points in the training data. •Identify the k closest neighbors (k = a predefined number). •Determine the majority class label among these neighbors. •Assign the majority class label to the new point. Advantages •Simple to understand and implement. •No explicit training phase; efficient for small datasets. •Non-parametric: No assumptions about data distribution. Disadvantages: •Computationally expensive for large datasets (requires computation of distances for all points). •Sensitive to irrelevant or redundant features. •Choice of kk and distance metric can greatly affect performance. •Struggles with imbalanced datasets, as minority classes may get outvoted. Applications: 1. Image Classification: Classifying images based on pixel-level features. 2. Text Categorization: Labeling text documents using distance metrics in vectorized text space. 3. Anomaly Detection: Identifying outliers by looking for points without enough neighbors in a radius. 4. Medical Diagnosis: Predicting diseases based on symptoms and historical cases Distance based clustering-K means Algorithm
K-Means is a popular distance-based clustering
algorithm used for partitioning a dataset into kk distinct clusters. It minimizes the variance within each cluster by iteratively updating cluster centroids and assigning data points to the nearest centroid. How K-Means Works 1. Initialization: •Select k, the number of clusters. •Randomly initialize k centroids (cluster centers) 2. Iteration: •Assign each data point to the nearest centroid using a distance metric, typically Euclidean distance. •Compute new centroids by taking the mean of all points assigned to each cluster. 3. Convergence: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached. Advantages •Simple to implement and computationally efficient. •Scales well to large datasets. •Works well for spherical clusters with similar sizes. Disadvantages •Requires predefining k, the number of clusters. •Sensitive to initialization (different initial centroids can lead to different results). •Assumes clusters are spherical and evenly sized. •Struggles with non-convex clusters or datasets with varying densities. Applications 1. Customer Segmentation: Grouping customers based on purchasing behavior. 2. Image Compression: Quantizing colors in an image. 3. Document Clustering: Grouping documents with similar topics. 4. Anomaly Detection: Identifying data points that do not fit any cluster. Hierarchical clustering Hierarchical clustering is an unsupervised machine learning algorithm that groups data points into clusters based on their similarity. It creates a hierarchy or tree-like structure (dendrogram) to represent the nested grouping of data at different levels of granularity. Types of Hierarchical Clustering 1. Agglomerative (Bottom-Up):Starts with each data point as its own cluster.Iteratively merges the closest clusters until a single cluster remains. •2. Divisive (Top-Down): •Starts with all data points in one cluster. •Recursively splits clusters into smaller clusters until each data point is its own cluster. Advantages •Does not require predefining the number of clusters (k). •Can capture hierarchical relationships among data points. •Works well for data with arbitrary shapes. Disadvantages •Computationally expensive for large datasets (O(n^2)). •Sensitive to noise and outliers. •Requires interpretation of the dendrogram to decide the number of clusters. Applications 1. Document Clustering: Grouping text documents with similar content. 2. Gene Expression Analysis: Clustering genes with similar expression profiles. 3. Market Segmentation: Identifying customer segments based on purchasing behavior. 4. Social Network Analysis: Detecting communities or groups.