How to Detect Outliers in Machine Learning

Novelty Detection with Local Outlier Factor (LOF) in Scikit Learn

Last Updated : 26 Apr, 2025

Novelty detection is the task of identifying previously unseen data points as being different from the "normal" data points in a dataset. It is used in a variety of applications, such as fraud detection, error detection, and outlier detection.

There are several different approaches to novelty detection, including:

One-class classification: This approach involves training a classifier on the normal data points in a dataset and then using it to predict whether a new data point is a normal data point or a novelty.
Density-based methods: These methods calculate the local density of points around each data point and compare it to the densities of points around other data points. Data points with a low density relative to their neighbors are considered to be novelties.
Distance-based methods: These methods calculate the distances between each data point and its nearest neighbors, and data points that are significantly far away from their nearest neighbors are considered to be novelties.
Clustering-based methods: These methods use clustering algorithms to group the data points into clusters, and data points that do not belong to any of the clusters are considered to be novelties.

Novelty detection can be a useful tool for identifying previously unseen data points that are significantly different from the normal data points in a dataset. It can be used to detect fraud, errors, or other unusual patterns in a dataset.

Novelty detection Vs Outliers detection?

Novelty detection and outlier detection are closely related but distinct concepts. Outlier detection refers to the task of identifying data points that are significantly different from the majority of the data points in a dataset. These data points are often referred to as "outliers."

On the other hand, novelty detection refers to the task of identifying previously unseen data points as being different from the "normal" data points in a dataset. In other words, novelty detection is about identifying data points that are different from what the model has seen before.

Both novelty detection and outlier detection involve identifying data points that are different from the majority of the data points in a dataset. However, outlier detection is focused on identifying data points that are different from the "normal" data points in a dataset that the model has already seen, while novelty detection is focused on identifying previously unseen data points as being different from the "normal" data points.

LocalOutlierFactor and Reachability distance?

The Local Outlier Factor (LOF) is an algorithm for identifying anomalous data points in a dataset. It does this by measuring the local density of points around each data point and comparing it to the densities of points around other data points.

To calculate the local density of points around each data point, the LOF algorithm uses a measure called the reachability distance. The reachability distance of a data point is a measure of how "difficult" it is to reach that point from other points in the dataset.

To calculate the reachability distance of a data point, the LOF algorithm first identifies the k nearest neighbors of the data point, where k is a user-specified parameter. It then calculates the distance between the data point and each of its k nearest neighbors. The reachability distance of the data point is then defined as the maximum of these k distances.

The reachability distance is used to calculate the local reachability density of a data point, which is the sum of the distances between the data point and its k nearest neighbors, divided by k. The local reachability density of a data point is a measure of the local density of points around the data point.

Finally, the outlier factor of a data point is calculated as the ratio of the local reachability density of the data point to the average local reachability density of its k nearest neighbors. A high outlier factor indicates that a data point is more likely to be an outlier, while a low outlier factor indicates that a data point is more likely to be a normal (non-outlier) data point.

The reachability distance and local reachability density are used by the LOF algorithm to identify anomalous data points in a dataset. The algorithm is useful for identifying data points that are significantly different from their neighbors, such as fraud or errors in a dataset. It is often used as a preprocessing step for other machine learning algorithms, such as clustering or classification.

Step-by-Step Implementation:

In sci-kit-learn, the LocalOutlierFactor class is in the sklearn.neighbors module can be used to perform novelty detection using the local outlier factor (LOF) algorithm. The LOF algorithm is a density-based outlier detection method that calculates the local density of each sample in the dataset and identifies samples that have a significantly lower density than their neighbors. These samples are considered to be outliers or novelties.

To use the LocalOutlierFactor class, you need to create an instance of the class and fit it to the data using the fit() method. For example:

Python3

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

# Generate random data
X = np.random.randn(100, 10)

# Create a LocalOutlierFactor estimator
# and fit it to the data
estimator = LocalOutlierFactor()
estimator.fit(X)

Once the LocalOutlierFactor estimator is fitted to the data, you can use it to obtain the outlier scores for each sample in the dataset. The outlier scores are calculated based on the local density of each sample and range from -1 to -infinity, with lower values indicating higher outlier scores.

To obtain the outlier scores, you can use the negative_outlier_factor_ attribute of the estimator. For example:

Python3

# Obtain the outlier scores for each sample
outlier_scores = estimator.negative_outlier_factor_

# Print the outlier scores for each sample
print(outlier_scores)

This code will print the outlier scores for each sample in the dataset. You can then use these scores to identify samples that are considered to be outliers or novelties.

You can also specify hyperparameters for the LocalOutlierFactor estimator, such as the number of neighbors to use for density estimation (the n_neighbors parameter) and the outlier detection method (the contamination parameter). For example:

Python3

# Create a LocalOutlierFactor estimator with
# hyperparameters and fit it to the data
estimator = LocalOutlierFactor(n_neighbors=5,
                               contamination=0.1)
estimator.fit(X)

This code will fit a LocalOutlierFactor estimator to the random data and obtain the outlier scores for each sample using the negative_outlier_factor_ attribute. The outlier scores are calculated based on the local density of each sample and range from -1 to -infinity, with lower values indicating higher outlier scores and create a LocalOutlierFactor estimator with the specified n_neighbors and contamination values and fit it to the data. The optimal values for these hyperparameters will depend on the specific dataset and should be determined through experimentation.

Python3

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

# Generate random data
X = np.random.randn(100, 10)

# Create a LocalOutlierFactor estimator and fit it to the data
estimator = LocalOutlierFactor()
estimator.fit(X)

# Obtain the outlier scores for each sample
outlier_scores = estimator.negative_outlier_factor_

# Print the outlier scores for each sample
print(outlier_scores)

Output:

[-1.29336673 -0.98663101 -1.01328312 -0.98843551 -1.0340768  -1.00630881
-0.99046301 -1.01851411 -1.00941979 -1.02585983 -0.99454281 -1.03826622
-1.00920089 -1.08435498 -0.98485871 -0.99414    -1.02193122 -1.13255894
-0.98870854 -1.08340603 -1.03462261 -0.99815638 -1.06346218 -1.05982866
-1.15648965 -0.97513857 -0.99884846 -1.01392852 -1.00915394 -1.02404234
-1.02786408 -0.99580036 -1.03977835 -1.0856313  -1.0369034  -1.01757096
-0.98141263 -0.9666988  -0.99826695 -0.98593089 -1.02410345 -1.03045039
-1.01843609 -1.00225046 -0.99271876 -1.04562085 -1.04143942 -1.06242416
-1.24595953 -1.21899134 -1.06365838 -0.99014377 -1.00305435 -0.9863289
-0.96339396 -0.99409326 -1.0110496  -0.99468687 -0.99819612 -1.02407759
-1.05802008 -1.26005187 -1.00061505 -0.96921694 -0.97023558 -1.05295619
-1.01049517 -1.02283846 -0.985272   -0.99179016 -1.00560031 -1.0708834
-1.05491243 -1.00190921 -1.13925738 -1.04666919 -1.00216646 -0.99883435
-1.0091551  -0.98864925 -1.03776316 -1.12661428 -1.05180372 -1.20713398
-1.02207957 -1.00696503 -0.98899481 -1.04758736 -0.98664004 -0.97553829
-0.98835569 -1.19497038 -0.99148634 -1.00208273 -1.01195274 -1.06184659
-1.05820208 -0.99283114 -1.11214065 -0.97880798]

LOF for Outlier Detection

Here is a line-by-line explanation of the code example that demonstrates how to use the LocalOutlierFactor model for outlier detection and novelty detection in scikit-learn:

Python3

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor

Importing Dataset

These three lines import the necessary modules and functions from scikit-learn. The load_breast_cancer function is used to load the breast cancer dataset, the StandardScaler transformer is used to standardize the data, and the LocalOutlierFactor class is used to create the outlier detection and novelty detection model.

This line loads the breast cancer dataset and stores it in the variables X and y. The return_X_y parameter is set to True to return the data and the target values separately.

Python3

# Load the dataset
X, y = load_breast_cancer(return_X_y=True)

Normalization of the Data

These two lines create a StandardScaler transformer and use it to standardize the data. Standardization of data refers to the process of scaling the data so that it has zero mean and unit variance. This is often done as a preprocessing step before applying machine learning algorithms, as it can help to stabilize the variance of the features and improve the model's performance.

Python3

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

This line creates a LocalOutlierFactor model with n_neighbors=20 and the default value of novelty=False, which indicates that the model will be used for outlier detection rather than novelty detection.

Python3

# Create the LocalOutlierFactor 
# model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=20)

This line fits the LocalOutlierFactor model to the standardized data and predicts the outlier scores for each data point. The fit_predict method returns an array of outlier scores, with 1 representing inliers and -1 representing outliers.

Python3

# Fit the model to the data and predict
# the outlier scores for each data point
outlier_scores = lof_outlier.fit_predict(X_scaled)

# Identify the outlier data points
outlier_indices = outlier_scores == -1
print("Outlier indices:", outlier_indices)

These lines identify the outlier data points by selecting the indices where the outlier scores are equal to -1 and printing the indices and scores of the outlier data points. This line creates a new LocalOutlierFactor model with n_neighbors=20 and novelty=True, which indicates that the model will be used for novelty detection rather than outlier detection.

Python3

# Create the LocalOutlierFactor model for
# outlier detection(Use novelty=True if
# you want to use LOF for novelty detection
# and predict on new unseen data)
lof_novelty = LocalOutlierFactor(n_neighbors=20,
                                 novelty=True)

lof_novelty.fit(X_scaled)

Complete Code Implementation:

In the below code, we have written all the sub-steps in one code block and now we will try to see the Novelty Detection algorithm in function.

Python3

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor

# Load the dataset
X, y = load_breast_cancer(return_X_y=True)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create the LocalOutlierFactor model for outlier detection
lof_outlier = LocalOutlierFactor(n_neighbors=20)
# Fit the model to the data and predict
# the outlier scores for each data point
outlier_scores = lof_outlier.fit_predict(X_scaled)

# Identify the outlier data points
outlier_indices = outlier_scores == -1
print("Outlier indices:", outlier_indices)

lof_novelty = LocalOutlierFactor(n_neighbors=20, novelty=True)
lof_novelty.fit(X_scaled)

# Use the model to predict whether new data points are novelties
new_data_point = [[2.0, 2.0, 2.0, 2.0, 2.0, 2.0,
                   2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0,
                   2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0,
                   2.0, 1.0, 3.0, 3.0, 3.0, 2.0, 1.0,
                   3.0, 3.0, 3.0]]
prediction = lof_novelty.predict(new_data_point)
print("Novelty detection for new data point:", prediction)

Output:

Outlier indices: [False False False  True False False False False False
  True False False
  True False False False False False False False False False False False
 False False False False False False False False False False False False
 False False  True False False False  True False False False False False
 False False False False False False False False False False False False
  True False False False False False False False  True False False  True
 False False False False False False  True False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False  True
 False False  True False False False False False False False False False
 False False False False False False  True False False False False False
 False False  True False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False  True False
  True False False False False False False False False False False False
 False False False False False False False False  True  True False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
  True False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False  True False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False  True False False False False False False
 False False False False False  True False False False False False False
 False False False False False False False False False  True False  True
 False False False False False False False False False False False False
  True  True False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False  True
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False]
 
Novelty detection for new data point: [1]

How to Detect Outliers in Machine Learning

vinayedula

Improve

Article Tags :

Practice Tags :

Similar Reads

Outlier detection with Local Outlier Factor (LOF) using R

In this article, we will study how Outlier detection with Local Outlier Factor (LOF) using R and what are some steps required for this. What are Outliers?Outliers are data points that significantly differ from the majority of the data in a dataset. They are unusual or rare observations that stand ap

Compare effect of different scalers on data with outliers in Scikit Learn

Feature scaling is an important step in data preprocessing. Several machine learning algorithms like linear regression, logistic regression, and neural networks rely on the fine-tuning of weights and biases to generalize better. However, features with different scales could prevent such models from

How to Detect Outliers in Machine Learning

In machine learning, an outlier is a data point that stands out a lot from the other data points in a set. The article explores the fundamentals of outlier and how it can be handled to solve machine learning problems.Table of Content What is an outlier?Outlier Detection Methods in Machine LearningTe

Face completion with a Multi-output Estimators in Scikit Learn

Face completion is a fascinating application of machine learning where the goal is to predict missing parts of an image, typically the face, using the existing data. Scikit-learn provides multi-output estimators which are useful for this kind of task. This post is a step-by-step tutorial on how to p

HBOS: Efficient Outlier Detection with Python

Outlier detectionÂ is a crucialÂ task in dataÂ analysis, helpingÂ to identify rareÂ and anomalous instancesÂ that deviate significantlyÂ from the majorityÂ of the data. OneÂ efficient methodÂ for unsupervised anomalyÂ detection isÂ the Histogram-Based Outlier ScoreÂ (HBOS). ThisÂ article willÂ delve into theÂ princ

Implementing PCA in Python with scikit-learn

Principal Component Analysis (PCA) is a dimensionality reduction technique. It transform high-dimensional data into a smaller number of dimensions called principal components and keeps important information in the data. In this article, we will learn about how we implement PCA in Python using scikit

HuberRegressor vs Ridge on Dataset with Strong Outliers in Scikit Learn

Regression is a commonly used machine learning technique for predicting continuous outputs. In some datasets, outliers can have a significant impact on the results. To handle such datasets with outliers, two common algorithms are Huber Regressor and Ridge Regression. This article will explore the di

Revealing K-Modes Cluster Features with Scikit-Learn

Clustering is a powerful technique in unsupervised machine learning that helps in identifying patterns and structures in data. While K-Means is widely known for clustering numerical data, K-Modes is a variant specifically designed for categorical data. In this article, we will delve into the K-Modes

Detecting outliers when fitting data with nonlinear regression

Nonlinear regression is a powerful tool used to model complex relationships between variables. However, the presence of outliers can significantly distort the results, leading to inaccurate parameter estimates and unreliable predictions. Detecting and managing outliers is therefore crucial for robus

Ledoit-Wolf vs OAS Estimation in Scikit Learn

Generally, Shrinkage is used to regularize the usual covariance maximum likelihood estimation. Ledoit and Wolf proposed a formula which is known as the Ledoit-Wolf covariance estimation formula; This close formula can compute the asymptotically optimal shrinkage parameter with minimizing a Mean Squa