Outlier Detection in High-Dimensional Data in Data Mining
Last Updated :
05 Dec, 2022
An outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different manner. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining.
Outlier Detection in High-Dimensional Data:
We might need to find outliers in high-dimensional data for some applications. Effective outlier detection faces enormous obstacles as a result of the dimensionality curse. The distance between objects may become heavily dominated by noise as the dimension increases. This means that the distance and similarity between two points in a high-dimensional space might not accurately represent their actual relationship. As a result, traditional outlier detection techniques become less effective as dimensionality rises. These techniques primarily use proximity or density to identify outliers.
They ought to be able to identify outliers as well as explain what the outliers mean. A high-dimensional data set contains a lot of features (or dimensions), so identifying outliers without explaining why they are outliers is not very helpful. The interpretation of outliers may be based on various factors, such as the particular subspaces in which the outliers are manifested or a determination of the "outlier-ness" of the objects. Such an interpretation can assist users in comprehending the potential significance and meaning of the outliers. The techniques should work with sparsity in high-dimensional spaces. As the dimensionality increases, noise becomes more and more dominant in the distance between objects. Data in high-dimensional spaces are therefore frequently scarce. Outliers should be properly modeled, for instance, by being adaptable to the subspaces representing the outliers and capturing the local behavior of data. Because the distance between two objects monotonically increases with dimensionality, using a fixed-distance threshold against all subspaces to find outliers is not good. The quantity of subspaces grows exponentially with dimensionality. It is not scalable to conduct an exhaustive combinatorial exploration of the search space, which contains all conceivable subspaces.
Outlier detection methods for high-dimensional data can be divided into three main approaches. These include extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outliers.
Extending Conventional Outlier Detection:
One method expands on traditional outlier detection techniques for high-dimensional data. It makes use of the common proximity-based outlier models. However, it employs alternative metrics or creates subspaces and looks for outliers there to compensate for the degradation of proximity measures in high-dimensional spaces. An illustration of this strategy is the HilOut algorithm. HilOut detects outliers based on distance, but instead of using the absolute distance, it uses the ranks of distance. HilOut specifically finds the k-nearest neighbors of each object, o, denoted by nn1(o),:::,nnk(o), where k is an application-dependent parameter. The definition of an object's weight is
w(o) = ∑ki=0 dist(o,nni(o))
The weights of each object are listed in descending order. Outliers are output as the top-l objects in weight, where l is a further user-specified parameter.
When the database is large and the dimensionality is high, it is expensive to compute the k-nearest neighbors for each object. HilOut uses space-filling curves to achieve an approximation algorithm that is scalable in both running time and space with respect to database size and dimensionality to address the scalability issue.
Outlier detection techniques may be added to or used in place of general feature selection and extraction methods to reduce dimensionality. A lower-dimensional space can be extracted, for instance, using principal components analysis (PCA). On such dimensions, normal objects are probably close to one another and outliers frequently deviate from the majority, so heuristically, the principal components with low variance are preferred.
Finding Outliers in Subspaces:
Searching for outliers in different subspaces is another method for outlier detection in high-dimensional data. The fact that the subspace can be used to interpret why and to what extent an object is an outlier when it is discovered to be one in a much lower dimensional subspace gives it a distinct advantage. Due to the overwhelming number of dimensions, this insight is extremely beneficial in applications with high-dimensional data.
S(C) = [ n(C) - f k n ] / [ √f k (1-f k )n]
If S.C < 0, then C has fewer objects than anticipated. The less-dense C is, and the more likely it is that the objects in C are outliers in the subspace, the smaller the value of S.C (i.e., the more negative). We can use normal distribution tables to determine the probabilistic significance level for an object that deviates significantly from the average for an a priori assumption of the data following a uniform distribution by assuming S.C. follows a normal distribution. The assumption of uniform distribution is generally untrue. The sparsity coefficient, however, continues to offer a perceptible indicator of a region's "outlier-ness."
Modeling High-Dimensional Outliers:
An alternative method for finding outliers in high-dimensional data tries to directly create new models for high-dimensional outliers. These models typically do not use proximity measures but rather new heuristics to find outliers that do not degrade in high-dimensional data.
It is worth noting that the angles formed by a point in the center of a cluster vary greatly. The angle variation is smaller for a point near the edge of a cluster. The angle variable is significantly smaller for an outlier point. This observation implies that the variance of angles for a point can be used to determine whether a point is an outlier.
We can model outliers by combining angles and distance. Mathematically, we use the distance-weighted angle variance as the outlier score for each point o. In other words, given a set of points, D, for a point, o ∈ D, the angle-based outlier factor (ABOF) is defined as
ABOF(o) = VARx, y∈ D, x≠o ,y≠o [ox,oy] / dist(o,x)2 dist(o,y)2 ,
where [,] represents the scalar product operator and dist(,) represents a norm distance Clearly, the smaller the ABOF, the farther a point is from clusters and the smaller the variance of a point's angles. The ABOD computes the ABOF for each point in the data set and returns a list of the points in ascending ABOF order. The time complexity of computing the exact ABOF for every point in a database is O(n3), where n is the number of points in the database. Clearly, this algorithm does not scale well for large data sets. To speed up the computation, approximation methods have been developed. The angle-based outlier detection concept has been generalised to work with any data type.
Conclusion:
Based on the information presented above, we conclude that outlier detection methods for high-dimensional data can be classified into three categories. Extending conventional outlier detection, finding outliers in subspaces, and modeling high-dimensional outliers are some of these. Finding appropriate data models, the dependence of outlier detection systems on the application involved, distinguishing outliers from noise, and providing justification for identifying outliers as such are all challenges in outlier detection.
Similar Reads
Distance-Based Outlier Detection in Data Mining
An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining. An outlier is a data object that deviates significantly from the rest of the dat
4 min read
Clustering High-Dimensional Data in Data Mining
Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Clustering is the task of dividing the population or data points into a number of groups such that
3 min read
Grid-Based Method For Distance-Based Outlier Detection in Data Mining
Outlier detection is currently thought to be a crucial data mining work with a variety of applications, including the detection of credit card fraud, criminal activity, and remarkable trends in datasets.The goal of outlier Detection, a crucial area of data mining, is to find unusual behaviour in a g
3 min read
Challenges of Outlier Detection in Data Mining
Outlier Detection means finding out the data objects whose properties and behaviour are different from the rest of the objects in the cluster or the data sets. Outlier Detection is the process of finding the outliers from the normal objects. It is essential to perform the Outlier Detection during th
5 min read
Managing High-Dimensional Data in Machine Learning
High-dimensional input spaces are a common challenge in machine learning, particularly in fields such as genomics, image processing, and natural language processing. These datasets contain a vast number of features, making them complex and difficult to manage. The "curse of dimensionality," a term c
6 min read
Clustering-Based approaches for outlier detection in data mining
Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset is a cluster such that objects are similar to each other. The set of clusters obtained from clustering analysis can be referred to as Clustering. For example: Segregating customers in a Retail market as a
6 min read
The Relationship Between High Dimensionality and Overfitting
Overfitting occurs when a model becomes overly complex and instead of learning the underlying patterns, it starts to memorize noise in the training data. With high dimensionality, where datasets have a large number of features, this problem further intensifies. Let's explore how high dimensionality
5 min read
Scalability and Decision Tree Induction in Data Mining
Pre-requisites: Data Mining Scalability in data mining refers to the ability of a data mining algorithm to handle large amounts of data efficiently and effectively. This means that the algorithm should be able to process the data in a timely manner, without sacrificing the quality of the results. In
5 min read
Entity Identification Problem in Data Mining
Nowadays, data mining is used in almost all places where a large amount of data is stored and processed. Data Integration is one of the major tasks of data preprocessing. Integration of multiple databases or data files into the single store of identical data is known as Data Integration. Data Integr
3 min read
Different Types of Data in Data Mining
Introduction : In general terms, âMiningâ is the process of extraction. In the context of computer science, Data Mining can be referred to as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. There are other kinds of data like semi-structur
7 min read