Outlier
Analysis
Identifying
Outliers
Assume that a given statistical process is generating a set of data objects.
An outlier is a data object that deviates significantly from the rest of the
objects, as if it were generated by a different mechanism
Outliers are different from noisy data.
Noise is a random error or variance in a measured variable. In general, noise
is not interesting in data analysis
Identifying
Outliers
Ex:- customer generating noise data in purchase, in a restaurant R
etc.
Noise should be removed before outlier detection
Outliers are interesting because they are suspected of not being
generated by the same mechanisms as the rest of the data.
Therefore, in outlier detection, it is important to justify why the
outliers detected are generated by some other mechanisms.
This is often achieved by making various assumptions on the rest of
the data and showing that the outliers detected violate those
assumptions significantly.
Outliers Types:-
Global outliers
Contextual Outliers
Collective Outliers
Outliers Types:-
Global outliers
◦ If it deviates significantly from the rest of the data set.
◦ Sometimes called point anomalies, and are the simplest type of outliers.
Outliers Types:-
Examples:-
◦ Global outlier detection is important in many applications. Consider intrusion detection
in computer networks, for example.
◦ If the communication behavior of a computer is very different from the normal patterns
(e.g., a large number of packages is broadcast in a short time), this behavior may be
considered as a global outlier and the corresponding computer is a suspected victim of
hacking.
◦ In trading transaction auditing systems, transactions that do not follow the regulations
are considered as global outliers and should be held for further examination.
Outliers Types:-
Contextual Outliers
◦ If it deviates significantly with respect to a specific context of the object
◦ Also called conditional outliers because they are conditional on the selected
context
◦ Ex:- “The temperature today is 28 degree Celsius. Is it exceptional?
◦ Depends on the context –the date, location, and other factors
In a given data set, a data object is a contextual outlier if it deviates significantly with
respect to a specific context of the object. Contextual outliers are also known as conditional
outliers because they are conditional on the selected context.
Therefore, in contextual outlier detection, the context has to be specified as part of the
problem definition.
Generally, in contextual outlier detection, the attributes of the data objects in question are
divided into two groups:
◦ Contextual attributes: The contextual attributes of a data object define the object’s context.
In the temperature example, the contextual attributes may be date and location.
◦ Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure.
Identifying Outliers- Collective Outliers
◦ Given a data set, a subset of data objects forms a collective outlier, if the objects as a
whole deviate significantly from the entire data set.
◦ Importantly, the individual data objects may not be outliers.
Example:- in an intrusion detection, a denial-of-service package from one node to
another is normal. But if several nodes keep sending denial-of-service packages to each
other, they as a whole should be considered as a collective outlier.
Unlike other outliers, in collective outlier detection, we have to consider not only the
behavior of individual objects, but also that of groups of objects.
Outlier detection (also known as anomaly detection) is the process of finding data
objects with behaviors that are very different from expectation.
Such objects are called outliers or anomalies.
Outlier detection is important in many applications in addition to fraud detection
such as medical care, public safety and security, industry damage detection, image
processing, sensor/video network surveillance, and intrusion detection.
* Challenges of Outlier Detection *
Modeling normal objects and outliers effectively
Application-specific outlier detection
Handling noise in outlier detection
Understandability
End