2
2
Abstract
This paper introduces a new methodology based on the field of Topological Data
Analysis for detecting structural anomalies in dependent sequences of complex data. A
motivating example is that of multivariate time series, for which our method allows to detect
global changes in the dependence structure between channels. The proposed approach is
lean enough to handle large scale data sets, and extensive numerical experiments back
the intuition that it is more suitable for detecting global changes of correlation structures
than existing methods. Some theoretical guarantees for quantization algorithms based on
dependent sequences are also provided.
Keywords: Topological Data Analysis, Unsupervised Learning, Anomaly Detection,
Multivariate Time Series, β-mixing coefficients.
1. Introduction
Monitoring the evolution of the global structure of time-dependent complex data, such as,
e.g., multivariate time series or dynamic graphs, is a major task in real-world applications of
machine learning. The present work considers the case where the global structure of interest
may be encoded by a persistence diagram, with a particular attention to the weighted
dynamic graph encoding the dependence structure between the different channels of a
multivariate time series. Such a situation may be encountered in various fields, such as e.g.
EEG signal analysis Mohammed et al. (2023) or monitoring of industrial processes Li et al.
(2022), and has recently given rise to an abundant literature - see, e.g. Zheng et al. (2023);
Ho et al. (2023) and references therein.
The specific monitoring task addressed in this paper is unsupervised anomaly detection,
that is to detect when the global structure is far enough from a so-called ’normal’ regime
to be considered as anomalous. In the multivariate time series framework, this amounts to
detect when the dependence patterns between channels depart from the normal regime. From
the mathematical point of view, this problem, in its whole generality, is ill-posed: one has
access to unlabeled data, in which it is tacitly assumed that the normal regime is prominent,
the goal is then to label data points as normal or abnormal in a fully unsupervised way.
In this sense, anomaly detection shows clear connection with outlier detection in robust
machine learning (for instance robust clustering as in Brécheteau et al. 2021; Jana et al.
2024). For more insights and benchmarks on the specific problem of anomaly detection in
time series the reader is referred to Paparrizos et al. (2022b) (univariate case) and to Wenig
et al. (2022) for the multivariate case.
We introduce a new framework, coming with mathematical guarantees, based on the use
of Topological Data Analysis (TDA), a field that has know an increasing interest to study
complex data - see, e.g. Chazal and Michel (2021) for a general introduction. Application
of TDA to anomaly detection in time series have raised a recent and growing interest: in
medicine (Dindin et al. 2019; G. et al. 2014; Chrétien et al. 2024), cyber security (Bruillard
et al. 2016), to name a few. Some general surveys on TDA applications to time series may
be found in Ravishanker and Chen (2019); Umeda et al. (2019).
In this paper, the proposed approach proceeds in three steps. First, the time-dependence
structure of a time series is encoded as a dynamic graph in which each vertex represents a
channel of the time series and each weighted edge encodes the dependence between the two
corresponding vertices over a time window. Persistent homology, a central theory in TDA,
is then used to robustly extract the global topological structure of the dynamic graph as
a sequence of so-called persistence diagrams. Second, we introduce a specific encoding of
persistence diagrams, that has been proven efficient and simple enough to face large-scale
problems in the independent case (Chazal et al. 2021). Finally, we produce a topological
anomaly score based on this encoding.
The two last steps may be applied to any dependent sequence of persistence diagrams,
encompassing diagrams generated from a sequence of filtered simplicial complexes in general.
Monitoring the dependence structure of multivariate time series being the practical motivation
of this work, the details of the full process (from raw data to anomaly scores) are given in
this setting.
1.1 Contributions
Our main contributions are the following.
- We produce a new machine learning methodology for learning the normal topological
behavior of complex data. This methodology is unsupervised, it does not need to be
calibrated on uncorrupted data, as long as the amount of corrupted data remains limited
with respect to the uncorrupted one. The proposed pipeline is easy to implement, flexible
and can be adapted to different specific applications and framework involving graph data
or more general topological data;
- In the multivariate time series case, the captured information is empirically shown to be
different and, in several cases, more informative than the one captured by other state-of-
the-art approaches. This methodology is lean by design, and enjoys novel interpretable
properties with regards to anomaly detection that have never appeared in the literature,
up to our knowledge;
2
Topological Analysis for Detecting Anomalies in Dependent Data
- The resulting method can be deployed on architectures with limited computational and
memory resources: once the training phase has been completed, the anomaly detection
procedure relies on a few memorized parameters and simple persistent homology compu-
tations. Moreover, this procedure does not require any storage of previously processed
data, preventing privacy issues;
- Some convergence guarantees for quantization algorithms - used to vectorize topological
information - in the dependent case are proven. These results do not restrict to the specific
setting of the paper and may be generalized in the general framework of M -estimation
with dependent observations;
- Extensive numerical investigation has been carried out in three different frameworks.
First on new synthetic data directly inspired from brain modeling problems as exposed
in Bourakna et al. (2022), that are particularly suited for TDA-based methods and may
be used as novel benchmark. They are added to the public The GUDHI Project (2015)
library at github.com/GUDHI/gudhi-data. Second on the comprehensive benchmark
’TimeEval’, that encompasses a large array of synthetic data sets Schmidl et al. (2022).
And third on a real-case ’Exathlon’ data set from Jacob et al. (2021-07). All of these
experiments assess the relevance of our approach compared with current state-of-the-art
methods. Our procedure, originating from concrete industrial problems, is implemented
and has been deployed within the Confiance.ai program and an open-source release is
incoming. Its implementation involves only standard, tested machine learning tools.
2. Methodology
This section describes the process to build an anomaly score from a dependent sequence
of topologically structured data. Our method can be roughly decomposed into three steps
(depicted in Figure 1):
• Step 1 : From raw data build a sequence of topological descriptors of the structure via
persistent homology. The output of this step is a dependent sequence of persistence
diagrams (see Section 2.1.2).
• Step 2 : Convert the sequence of persistence diagrams into a vector-valued sequence, using
the approach depicted in Chazal et al. (2021) (see Section 2.2.2).
• Step 3 : Build from the vector-valued sequence an anomaly score function (based on robust
Mahalanobis distance, see Section 2.3).
3
Chazal, Levrard and Royer
Figure 1: TADA general scheme for producing anomaly scores with topological information from the
original time series.
To be consistent with the applications exposed in Section 4, details of these steps will be
given in the case where original data consists of multivariate time series (Yt )t∈[0,L] ∈ RD ×[0, L]
and the anomalies to be detected pertain to the dependence structure between channels.
This particular setting impacts Step 1 only. In other situations, Step 2 and 3 may apply as
such once suitable persistence diagrams are built from raw data.
The two subsections that follow give details on the different steps of Algorithm 1. We
start with a brief description of the TDA tools that we use.
4
Topological Analysis for Detecting Anomalies in Dependent Data
Definition 1 Let αmin 6 minv,v0 ∈V sv,v0 and αmax > maxv,v0 ∈V sv,v0 be two real numbers.
The Vietoris-Rips filtration associated to G is the filtration (VRα (G))α∈[αmin ,αmax ] with vertex
set V defined by
σ = [v0 , · · · , vk ] ∈ VRα (G) if and only if svi ,vj 6 α, for all i, j ∈ [[0, k]],
for k > 1, and [v] ∈ VRα (G) for any v ∈ V and any α ∈ [αmin , αmax ].
The topology of VRα (G) changes as α increases: existing connected components may
merge, loops and cavities may appear and be filled, etc. Persistent homology provides a
mathematical framework and efficient algorithms to encode this evolution of the topology
(homology) by recording the scale parameters at which topological features appear and
disappear. Each such feature is then represented as an interval [αb , αd ] representing its life
span along the filtration. Its length αd − αb is called the persistence of the feature. The
set of all such intervals corresponding to topological features of a given dimension d - d = 0
for connected components, d = 1 for 1-dimensional loops, d = 2 for 2-dimensional cavities,
etc... - is called the persistence barcode of order d of G. It is also classically represented
as a discrete multiset Dd (G) ⊂ [αmin , αmax ]2 where each interval [αb , αd ] is represented by
the point with coordinates (αb , αd ) - a basic example is given on Figure 2. Adopting the
perspective of Chazal and Divol (2018); Royer et al. (2021); Chazal et al. (2021), in the
sequel of the
P paper, the persistence diagram Dd (G) will be considered as a discrete measure:
Dd (G) := p∈Dd (G) δp where δp is the Dirac measure centered at p. To control the influence
of the possibly many low persistence features, the atoms in the previous sum can be weighted:
X
Dd (G) := ω(b, d)δ(b,d) ,
(b,d)∈Dd (G)
5
Chazal, Levrard and Royer
18.0
25.0
v1 v1 v1 v1
37.0
v3 v3 v3 v3
64.0
34.0
25.0
v2 v2 v2 v2
(34, 37)
(0, 18)
Figure 2: The persistence diagrams of order 0 and 1 of a simple weighted graph G whose vertices are
4 points in the real plane and edge weights are given by the squared distances between them. Here
αmin and αmax are chosen to be 0 and 64 respectively. The first line represents G and VRα (G) for
different values of α. The persistence barcodes and diagrams of order 0 and 1 are represented in red
and blue respectively on the second line.
The relevance of the above construction relies on the persistence stability theorem Chazal
et al. (2016). It ensures that close weighted graphs have close persistence diagrams. More
precisely, if G, G0 are two weighted graphs with same vertex set V and edge weight functions
s : V × V → R and s0 : V × V → R respectively, then for any order d, the so-called
bottleneck distance between the persistence diagrams Dd (G) and Dd (G0 ) is upper bounded
by ks − s0 k∞ := supv,v0 ∈V |sv,v0 − s0v,v0 | - see Chazal et al. (2014) for formal persistence
stability statements for Vietoris-Rips complexes.
6
Topological Analysis for Detecting Anomalies in Dependent Data
Then, for each sub-interval [st, st + ∆], a coherence graph Gt is built, starting from the
fully-connected graph ([[1, D]], E) and specifying edge values as si,j,t = 1 − Cort (Yi , Yj ), that
is 1 minus the correlation between channels i and j computed in the interval t.
The size of the time bins ∆ is the time series equivalent of the resolution in images. In
practice, choosing a suitable resolution ∆ requires some prior information on the resolution
or scale at which the anomalies occur in the time series and might be detected. Up to
our knowledge, methods dedicated to detecting changes in time series implicitly use some
form of this prerequisite, see Section 4. Another way to formulate this is that the notion of
anomalies in time series is better defined at a certain resolution ∆, but a proper and explicit
definition is outside the scope of this work.
Since the number n + 1 of coherence graphs satisfies n = b(L − ∆)/sc, choosing a large
stride s produces a reduced amount of sample size n for the task that follows. In practice,
the stride s is chosen to be small, sometimes 1, as small as the computation time and power
allow so as to feed the next step with as much data as possible. From a theoretical point of
view, it turns out that the choice of s has no impact on the convergence results (see Section
3.3 for details).
Lastly, the persistence diagrams of the Vietoris-Rips filtration are computed (one per
(d)
homology order), resulting in sequences of diagrams Xt , with 0 6 t 6 n and d is the
homological order. An example of sequences of windows and corresponding persistence
diagrams is represented Figure 3. In what follows, a fixed homology order is considered, so
that the index d is removed. In practice, the vectorization steps that follows are performed
order-wise, as well as the anomaly detection procedure.
It is worth noting that other dependence measures such as the ones based on coherence
Ombao and Pinto (2021) may be chosen instead of correlation to build the weighted graphs.
7
Chazal, Levrard and Royer
Such alternative choices do not affect the overall methodology nor the theoretical results
provided below.
In the numerical experiments, we give results for the correlation weights, that have
the advantages of simplicity and carry a few insights: following Bourakna et al. (2022),
such weights are enough to detect structural differences in the case where the channels Yj
are mixture of independent components Zp ’s, the weights of the mixture being given by a
(hidden) graph on the p’s whose structure drives the behavior of the observed persistence
diagrams.
Finally, it is worth recalling here that Vietoris-Rips filtration may be built on top of
arbitrary metric spaces, so that the persistence diagram construction may be performed in
more general cases, encompassing valued graphs (with value on nodes or edges) for instance.
The vectorization and detection steps below are based on the inputs of such persistence
diagrams, and can be applied verbatim to any situation where persistence diagrams can be
built from data.
Once a sequence of persistence diagrams (Xi )i=1,...,n is built from data, the next step is
to convert these persistence diagrams into a vector-valued sequence. There exists several
data-driven methods to perform this vectorization step such as Hofer et al. (2019); Carrière
et al. (2020) to name a few supervised ones.
The simplest methods Adams et al. (2017); Royer et al. (2021) consist in evaluating a
persistence diagram X considered as a discrete measure on several test functions of the form
u 7→ ψ(ku − ck/σ), where ψ is a fixed kernel, c and σ are respectively a centroid and a scale
that vary to provide different test functions. Intuitively speaking, this amounts to encode
how much mass a persistence diagram spread around different centers at different scales.
Notice that such vectorizations fit within the general frameworks of linear representations of
persistence diagrams, see Wu et al. (2024); Divol and Chazal (2019).
In the case where the persistence diagrams (Xi )i=1,...,n are i.i.d., it has been shown that
the centroids and scales selection procedure exposed in Royer et al. (2021) offers several
advantages compared to the fixed grid approach of Persistence Image Adams et al. (2017).
For instance, for the same budget K of centroids, the approach of Royer et al. (2021)
experimentally outperforms the fixed-grid approach (see Royer et al. 2021, Table 3). In
most of the application depicted in Royer et al. (2021); Chazal et al. (2021) as well as
in Section 4, a budget K = 10 is enough to capture most of the topological information.
Such a low-dimensional vectorization usually spares the user a dimension reduction step.
Furthermore some theoretical guarantees are available Chazal et al. (2021) that assess the
8
Topological Analysis for Detecting Anomalies in Dependent Data
relevance of this procedure in a clustering framework. We thus adopt the same strategy for
the dependent case.
Algorithm 2: Persistence diagram vectorization
Input: K: dimension of the output vector. T : stopping time.
Data: X1 , . . . , Xn discrete measures.
1 Use Algorithm 3 (with stopping time T ) or 4 to get K centroids
(T ) (T )
c(T ) = (c1 , . . . , cK ).
2 for i = 1, . . . , n do
3 Use Algorithm 5 with parameter c(T ) on Xi to get vi ∈ RK .
Output: Vectorization v = (v1 , . . . , vn ).
Let us mention here that this vectorization algorithm may be applied to any dependent
sequence of measures (such as texts, point processes realizations, etc.), persistence diagrams
data being one example of such a framework. The following subsections give details on
the centroids computation algorithms (Algorithm 3 and 4) and the vectorization algorithm
(Algorithm 5) in the persistence diagram case.
Recall here from Section 2.1.1 that the persistence diagrams Xi ’s are thought of as discrete
measures on R2 , that is
X
Xi = ω(b,d) δ(b,d) ,
(b,d)∈Di
where Di is the i-th persistence diagram considered as a multiset of points (see Section
2.1.1), and ω(b,d) are weights given to points in the persistence diagram (usually given as a
function of the distance from the diagonal, see e.g. Adams et al. 2017).
The goal is to find K centroids (c1 , . . . , cK ) ∈ R2 and scales (σ1 , . . . , σk ) such that the
vectorization Xi 7→ (Xi (du)ψ(ku−c1 k/σ1 ), . . . , Xi (du)ψ(ku−cK k/σK )) is an accurate enough
representation of the original persistence diagrams (where X(du)f (u) means integration of
f with respect to X).
The key idea behind the ATOL procedure Chazal et al. (2021) is to choose as centroids
nearly optimal minimizers of the empirical least-square criterion
1 Pn
where X̄n denotes the empirical mean measure, X̄n = n i=1 Xi .
As mentioned above, this method offers several advantages over fixed-grid strategies such
as Persistence Image in the i.i.d. case (see Royer et al. 2021; Chazal et al. 2021 for a more
in-depth discussion). The two algorithms that follow expose two methods to approximately
9
Chazal, Levrard and Royer
minimize (1), their theoretical justification is exposed in Section 3. Let us begin with the
batch algorithm.
Algorithm 3: Centroids computation - ATOL - Batch algorithm
Input: K: number of centroids. T : stopping time.
Data: X1 , . . . , Xn discrete measures.
(0) (0)
1 Initialization: c(0) = (c1 , . . . , cK ) randomly chosen from X̄n .
2 for t = 1, . . . , T do
3 for j = 1, . . . , K do
(t−1) (t−1)
4 Wj,t−1 ← {x ∈ R2 | ∀i 6= j kx − cj k 6 kx − ci k} (ties arbitrarily
broken).
5 if X̄n (Wj,t−1 ) 6= 0 then
(t)
6 cj ← (X̄n (du) u1Wj,t−1 (u) )/X̄n (Wj,t−1 ).
7 else
(t)
8 cj ← random sample from X̄n .
(T ) (T )
Output: Centroids c(T ) = (c1 , . . . , cK ).
Algorithm 3 is the same as in the i.i.d. case (Chazal et al., 2021, Algorithm 1). Moreover,
almost the same convergence guarantees as in the i.i.d. case may be proven: for a good-
enough initialization, only 2 log(n) iterations are needed to achieve a statistically optimal
convergence (see Theorem 5 below). Therefore, a practical implementation of Algorithm 3
should perform several threads based on different initializations (possibly in parallel), each of
them being stopped after 2 log(n) steps, yielding a complexity in time of O(n log(n) × nstart )
(where nstart ) is the number of threads.
As for the i.i.d. case, an online version of Algorithm 3 may be conceived, based on
mini-batches. In what follows, for a convex set C ⊂ Rd , we let πC denote the Euclidean
projection onto C.
Algorithm 4: Centroids computation - ATOL - Minibatch algorithm
Input: K: number of centroids. q: size of mini-batches. R: maximal radius.
Data: X1 , . . . , Xn discrete measures.
(0) (0)
1 Initialization: c(0) = (c1 , . . . , cK ) randomly chosen from X̄n . Split X1 , . . . , Xn
into n/q mini-batches of size q:
B1,1 , B1,2 , B1,3 , B1,4 , . . . , Bt,1 , Bt,2 , Bt,3 , Bt,4 , . . . , BT,1 , BT,2 , BT,3 , BT,4 , T = n/4q.
2 for t = 1, . . . , T do
3 for j = 1, . . . , K do
(t−1) (t−1)
4 Wj,t−1 ← {x ∈ R2 | ∀i 6= j kx − cj k 6 kx − ci k} (ties arbitrarily
broken).
5 if X̄Bt,1 (Wj,t−1 ) 6= 0 then
(t)
6 cj ← πB(0,R) (X̄Bt,3 (du) u1Wj,t−1 (u) )/X̄Bt,1 (Wj,t−1 ) .
7 else
(t) (t−1)
8 cj ← cj .
(T ) (T )
Output: Centroids c(T ) = (c1 , . . . , cK ).
Contrary to Algorithm 3, Algorithm 4 differs from its i.i.d. counterpart given in Chazal
et al. (2021). First, the theoretically optimal size of batches is now driven by the decay of
10
Topological Analysis for Detecting Anomalies in Dependent Data
the β-mixing coefficients of the sequence of persistence diagrams, as will be made clear by
Theorem 6 below.
Second, half of the sample are wasted (the Bt,j ’s with even j). This is due to theoretical
constraints to ensure that the mini-batches that are used are spaced enough to guarantee a
prescribed amount of independence. Of course, the even Bt,j ’s could be used to compute a
parallel set of centroids. However, in the numerical experiments, all the sample are used (no
space is left between minibatches) with no noticeable side effect.
From a computational viewpoint, Algorithm 4 is single-pass, so that, if nstart threads
are run, the global complexity is in O(n × nstart ).
Figure 4 below depicts an instance of centroids computed by these algorithms.
Figure 4: Left: three representative topological descriptors in the form of persistence diagrams with
Homology dimensions 0, 1 and 2. Right: sum of topological descriptors and their centroids (stars in
purple, two by dimension) computed from them in dimensions 0, 1 and 2 according to Algorithm 3
or Algorithm 4.
(T )
that roughly seizes the width of the area corresponding to the centroid cj . Other choices of
kernel ψ are possible (see e.g. Chazal et al. 2021), as well as other methods for choosing the
11
Chazal, Levrard and Royer
scales. The proposed approach has the benefit of not requiring a careful parameter tuning
step, and seems to perform well in practice.
We encapsulate this vectorization method as follows, and an example vectorization is
shown in Figure 5.
Algorithm 5: Vectorization step
Input: Centroids c1 , . . . , cK
Data: A persistence diagram X
1 for j = 1, . . . , K do
2 Compute σj as in (3);
3 vj ← X(du)ψAT (ku − cj k/σj )
Output: Vectorization v = (v1 , . . . , vK ).
Figure 5: Left: sum of topological descriptors and their centroids (stars in purple, two by dimension)
computed from them in dimensions 0, 1 and 2 by Algorithm 3 or 4. Right: the derived topological
vectorization of the entire time series computed relative to each center according to Algorithm 5.
12
Topological Analysis for Detecting Anomalies in Dependent Data
In the case where the base sample can be corrupted, robust strategies for mean and
covariance estimation such as Rousseeuw and Driessen (1999); Hubert et al. (2018) may be
employed. More precisely for a fraction parameter h ∈ [0; 1] we use the Minimum Covariance
Determinant estimator (MCD) defined by
!
1 X
Iˆ ∈ arg min Det (vi − v̄I )(vi − v̄I )T ,
I⊂{1,...,n},|I|=dnhe |I|
i∈I
µ̂ = v̄Iˆ,
1 X
Σ̂ = c0 (vi − µ̂)(vi − µ̂)T , (5)
ˆ
|I| ˆ
i∈I
where v̄I denotes empirical mean on the subset I, and c0 is a normalization constant that can
be found in Hubert et al. (2018). In the anomaly detection setting, assuming that at least
half of sample points are not anomalies lead to the conservative choice h = 1/2. In practice,
for K-dimensional vectorizations v1 , . . . , vn , we adopt the default value h = (n + K + 1)/2n
prescribed in Rousseeuw and Driessen (1999); Lopuhaa and Rousseeuw (1991) that maximizes
the finite sample breakdown point of the resulting covariance estimator. In all the experiments
exposed in Section 4, we use the approximation of (5) provided in Rousseeuw and Driessen
(1999).
Now, for a new vector v, a detection score is built via
that expresses the normalized distance to the mean behavior of the base regime. We refer to
an illustrative example in Figure 6.
Figure 6: Left: the derived topological vectorization of the entire time series computed relative
to each center according to Algorithm 5. Right: (top, in blue) the binary anomalous timestamps
y of the original signal, matches (bottom, in orange) the topological anomaly score based on the
dimension 0 and 1 features of Algorithm 6.
13
Chazal, Levrard and Royer
If we let ŝ denote the score function based on data (Yt )t∈[0,T ] , then anomaly detection
tests of the form
Tα (v) = 1ŝ(v)>tα
may be built. To assess the relevance of this family of tests, the ROC AUC and RANGE PR AUC
metrics are used in the Application Section 4.
Should a test with specific type I error α be needed, a calibration of tα as the 1 − α
quantile of scores on a sample from the normal regime could be performed. Section 3.2
theoretically proves that this strategy is grounded.
2.4 Summary
We can now summarize the whole procedure into the following algorithm, with complementary
descriptive scheme in Figure 1.
Algorithm 6: TADA: Detection score from base regime time series
Input:
- TDA parameter: an integer p maximal homology order;
Data: A multivariate time series (Yt )t∈[0,L] ∈ RD × [0, L] (base regime, possibly
corrupted)
1 Convert (Yt )t∈[0,L] into n persistence diagrams (Xi )i=1,...,n via Algorithm 1;
2 Convert (Xi )i=1,...,n into v1 , . . . , vn using Algorithm 2;
3 Compute an anomaly score function s : RK → R+ defined by (6) from the
vector-valued sequence.
Output: An anomaly score function s : RK → R+ .
As such Algorithm 6 produces an anomaly score for all persistence diagrams (Xi )i=1,...,n .
For practical uses one often needs to get anomaly scores on the initial timestamps t ∈ [0, L].
In this case a standard time-sequence remapping is performed, we defer to Section 4 for
explicit details.
We summarize here the impact and defaults choices of the parameters:
• Time series parameters: the window size ∆ must be chosen according to the length of the
anomalies the user wants to detect. To be conservative, the stride s should be chosen as
small as computation time allows, and we use a heuristic default s = ∆/10, although in
theory any reasonable choice of s yields the same result.
• p maximal homology order: this depends on the complexity of the objects analyzed as each
homology order encodes topological features of the corresponding dimension (connected
components for homology order 0, cycles and holes for homology order 1, 2-dimensional
14
Topological Analysis for Detecting Anomalies in Dependent Data
voids for homology order 2...). The computational cost of persistence and the difficulty to
interpret persistence diagrams increase with the homology order. As a consequence, in
most practical applications the maximum homology order is set to 1 or 2. In our time
series application case, we restrict to p = 1 as the produced dependence structure generally
do not exhibit relevant higher dimensional features.
• Vectorization parameters: the dimension K can be chosen small, as it is one of the virtues
of this algorithm that it is very efficient even at very small scales. The default value
K = 5 by homology dimension yields good results in practice. Since Step 2 and 3 are fast
from a computational point of view, the user may try several increasing values of K and
stop when satisfied. The number of threads nstart corresponds to the number of parallel
runs of Algorithm 3 or 4, based on different initializations. The default value nstart = 10
performs well in practice. The choice of T is relevant only when Algorithm 3 is used, in
which case the default value T = 2 log(n) works well in theory and practice. The choice of
q is relevant only when Algorithm 4 is used. In this case several increasing values of q
may be tried, or q may be chosen based on prior knowledge on the mixing coefficient of
the sequence of persistence diagrams (see Section 3.1 for details).
Overall, only two parameters play an important role in our procedure: the window
size ∆ that corresponds to the resolution of the anomalies to be detected, and K the
size of the vectorizations of persistence diagrams. Importantly, no prior knowledge on the
temporal dependence structure (the mixing coefficients for instance) of the original time
series (Yt )t∈[0,L] is needed. All the experiments described in Section 4 use the default values.
Our proposed anomaly detection procedure Algorithm 6 has a lean design for the following
reasons. First it has few parameters, all of them coming with default values, at the exception
of the ∆ time series resolution that must be adjusted by reflecting on the type of anomalies
one is looking for. Second, very little tuning is needed. In the entire application sections
to come, the only parameter to change will be that resolution parameter ∆, a parameter
shared with other methods. All other parameters are set to default values. Third, upon
learning some data, TADA does not require a lot of memory: only the results of Algorithms 5
(centroids) and 6 (training vectorization mean and variance) are needed in order to produce
topological anomaly scores. This implies that our methodology is easy to deploy, and requires
no memory of training data which is often welcome in contexts of privacy for instance. It
also means that the methodology is able to compare very favorably to methods that are
memory-heavy such as tree-based methods, neural networks, etc.
15
Chazal, Levrard and Royer
3. Theoretical results
In this section we intend to assess the relevance of our methodology from a theoretical point
of view. Sections 3.1 and 3.2 give results in the general case where the sample is a stationary
sequence of random measures. Section 3.3 provides some details on how persistence diagrams
built from multivariate time series as exposed in Section 2.1.1 can be casted into this general
framework.
Definition 2 For R, M > 0 and Nmax ∈ N∗ , we let MNmax (R, M ) denote the set of discrete
measures µ on Rd that satisfy
1. Supp(µ) ⊂ B(0, R),
2. µ(R2 ) 6 M ,
3. |Supp(µ)| 6 Nmax .
Accordingly, we let M(R, M ) denote the set of measures such that 1 and 2 hold.
where E(X) is the so-called mean measure E(X) : A ∈ B(R2 ) 7→ E(X(A)). Note here that
X ∈ M(R, M ) ensures that Copt is non-empty (see, e.g., Chazal et al. 2021, Section 3). For
the aforementioned result to hold, a structural condition on E(X) is also needed.
For a vector of centroids c = (c1 , . . . , ck ) ∈ B(0, R)k , we let
so that (W1 (c), . . . , Wk (c)) forms a partition of R2 and N (c) represents the skeleton of the
Voronoi diagram associated with c. The margin condition below requires that the mass of
E(X) around N (c∗ ) is controlled, for every possible optimal c∗ ∈ Copt . To this aim, let us
denote by B(A, t) the t-neighborhood of A, that is {y ∈ Rd | d(y, A) 6 t}, for any A ⊂ Rd
and t > 0. The margin condition then writes as follows.
16
Topological Analysis for Detecting Anomalies in Dependent Data
Definition 3 E(X) ∈ M(R, M ) satisfies a margin condition with radius r0 > 0 if and only
if, for all 0 6 t 6 r0 ,
Bpmin
sup E(X) (B(N (c∗ ), t)) 6 t,
c∗ ∈C opt 128R2
According to (Chazal et al., 2021, Proposition 7), B and pmin are positive quantities whenever
E(X) ∈ M(R, M ). In a nutshell, a margin condition ensures that the mean distribution
E(X) is well-concentrated around k poles. For instance, finitely-supported distributions
satisfy a margin condition. Up to our knowledge, margin-like conditions are always required
to guarantee convergence of Lloyd-type algorithms Tang and Monteleoni (2016); Levrard
(2018) in the i.i.d. case.
For our motivating example of persistence diagrams built from a sequence of correlation
matrices between a time series channels, we cannot assume anymore independence between
observations. To adapt the argument of Chazal et al. (2021) in this framework, a quantifica-
tion of dependence between discrete measures is needed. We choose here to seize dependence
between observations via β-mixing coefficients, whose definition is recalled below.
Definition 4 For t ∈ Z we denote by σ(−∞, t) (resp. σ(t, +∞)) the sigma-fields generated
by . . . , Xt−1 , Xt (resp. Xt , Xt+1 . . .). The beta-mixing coefficient of order q is then defined
by
" #
β(q) = sup E sup |P(B | σ(−∞, t)) − P(B)| .
t∈Z B∈σ(t+q,+∞)
Recalling that the sequence of persistence diagrams is assumed to be stationary, its beta-
mixing coefficient of order q may be subsequently written as
where dT V denotes the total variation distance and PZ denotes the distribution of Z, for
a generic random variable Z. As detailed in Section 3.3, mixing coefficients of persistence
diagrams built from a multivariate time series may be bounded in terms of mixing coefficients
of the base time series. Whenever these coefficients are controlled, results from the i.i.d.
case may be adjusted to the dependent one.
We begin with an adaptation of (Chazal et al., 2021, Theorem 9) to the dependent case.
output of Algorithm 3.
17
Chazal, Levrard and Royer
If q is such that β(q)2 /q 3 6 n−3 , and c(0) ∈ B(Copt , R0 ), then, for n large enough, with
2
probability larger than 1 − c nκqkM
2 p2 − 2e−x , we have
0 min
B 2 r02 q M 2 R2 k 2 d log(k) q
inf kc(T ) − c∗ k2 6 + C (1 + x),
∗
c ∈Copt 512R2 n p2min n
for all x > 0, where C is a constant.
Moreover, if q is such that β(q)/q 6 n−1 and c(0) ∈ B(c∗ , R0 ), it holds
dk 2 R2 M 2 log(k) q
E ∗inf kc(T ) − c∗ k2 6 C
c ∈Copt κ20 p2min n
Intuitively speaking, Theorem 5 provides the same guarantees as in the i.i.d. case,
but for a ’useful’ sample size n/q. This ’useful’ sample size corresponds to the number of
sample measures that are spaced enough (in fact q-spaced) so that they may be considered
independent enough (with respect to the targeted convergence rate in q/n). This point of
view seems ubiquitous in machine learning results based on dependent sample (see, e.g.,
Agarwal and Duchi 2013, Theorem 1 or Mohri and Rostamizadeh 2010, Lemma 7).
Assessing the optimality of the requirements on β(q) is difficult. Following (Mohri
and Rostamizadeh, 2010, Corollary 20) and comments below, the β(q) 6 q/n condition
we require to get a convergence rate in expectation seems optimal for polynomial decays
(β(q) = O(q −a ), a > 0) in an empirical risk minimization framework. However, this choice
leads to a convergence rate in (q/n)(a−1)/(4a) for Mohri and Rostamizadeh (2010), larger
than our (q/n) rate. Though the output of Algorithm 3 is not an empirical risk minimizer,
it is likely that it has the same convergence rate as if it were (based on a similar behavior
for the plain k-means case, see e.g., Levrard 2018). The difference between convergence
rates given in Mohri and Rostamizadeh (2010) and Theorem 5 might be due to the fact that
Mohri and Rostamizadeh (2010) settles in a ’slow rate’ framework, where the convexity of
the excess risk function is not leveraged, whereas a local convexity result is a key argument
in our result (explicited by Chazal et al. 2021, Lemma 21).
In a fast rate setting (i.e. when the risk function is strictly convex), (Agarwal and
Duchi, 2013, Theorem 5) also suggests that a milder requirement in β(q)/q 6 n−1 might
be enough to get a O(q/n) convergence rate in expectation, for online algorithms under
some assumptions that will be discussed below Theorem 6 (convergence rates for an online
version of Algorithm 3). Up to our knowledge there is no lower bound in the case of
stationary sequences with controlled β coefficients that could back theoretical optimality of
such procedures.
At last, the sub-exponential rate we obtain in the deviation bound under the stronger con-
dition β(q)2 /q 3 6 n−3 seems better than the results proposed in (Mohri and Rostamizadeh,
2010, Corollary 20) or (Agarwal and Duchi, 2013, Theorem 5) in terms of large deviations
(in (q/n)x here to get an exponential decay). Determining whether the same kind of result
may hold under the condition β(q) 6 q/n remains an open question, as far as we know.
Nonetheless, Theorem 5 provides some convergence rates (in expectation) for several
decay scenarii on β(q):
• if β(q) 6 Cρq , for ρ < 1, then an optimal choice of q is q = c log(n), providing the same
convergence rate as in the i.i.d. case (Chazal et al., 2021, Theorem 9), up to a log(n)
factor.
18
Topological Analysis for Detecting Anomalies in Dependent Data
1
• if β(q) = Cq −a , for a > 0, then an optimal choice of q is q = Cn a+1 , that yields a
1
convergence rate in n−1+ a+1 .
In the last case, letting a → +∞ allows to retrieve the i.i.d. case, whereas a → 0 has for
limiting case the framework when only one sample is observed (thus leading to a non-learning
situation).
Whatever the situation, a benefit of Algorithm 3 is that a correct choice of q, thus the
prior knowledge of β(q), is not required to get an at least consistent set of centroids, by
log(n)
choosing T = d log(4/3) e. This will not be the case for the convergence of Algorithm 4, where
the size of minibatches q is driven by prior knowledge on β.
2 2
Theorem 6 Let q be large enough so that β(q/18) q2
6 n−2 and q > c0 pk2 Mκ2 log(n), for
R1 min 0
a constant c0 that only depends on 0 β −1 (u)du. Provided that E(X) satisfies a margin
condition, if the initialization satisfies the same requirements as in Theorem 5, then the
output of Algorithm 4 satisfies
kdM R2
(T ) ∗ 2
E inf kc −c k 6 128 .
∗
c ∈Copt pmin (n/q)
As in Rio (1993), the generalized inverse β −1 is defined by β −1 (u) = |{k ∈ N∗ | β(k) > u}|. In
1
particular, for β(q) ∼ q −a , 0 β −1 (u)du is finite only if a > 1 (that precludes the asymptotic
R
a → 0).
The requirement β(q)/q 2 = O(n−2 ) is stronger than in Theorem 5, thus stronger that
the β(q)/q = O(n−1 ) suggested by (Agarwal and Duchi, 2013, Theorem 5) in a similar online
setting. Note however that for (Agarwal and Duchi, 2013, Theorem 5) to provide a O(q/n)
rate under the requirement β(q)/q = O(n−1 ), two other terms have to be controlled:
1. a total step sizes term in Tt=1 kct − c(t−1) k that must be of order O(1). Controlling
P
this term would require an slight adaptation of Algorithm 4, for instance by clipping
gradients.
P
T 2 (t) ) − d2 (u, c∗ )
2. a regret term in E X̄
t=1 BT (du) d (u, c that must be of order O(q).
The behavior of this term remains unknown in our setting, so that determining whether
the milder condition β(q)/q = O(n−1 ) is sufficient remains an open question.
Let us emphasize here that, to optimize the bound in Theorem 6, that is to choose
the smallest possible q, prior knowledge of β(q) is required. This can be the case when
the original multivariate time series Yt follows a recursive equation as in Bourakna et al.
(2022). Otherwise, these coefficients may be estimated, using histograms as in McDonald
et al. (2015) for instance.
As in the batch case, the required lower bound on q corresponds to the ’optimal’ choice
of minibatch spacings so that consecutive even minibatches may be considered as i.i.d.. It is
then not a surprise that we recover the same rate as in the i.i.d. case, but with n/q samples
(see Chazal et al. 2021, Theorem 10). As for the batch situation, several decay scenarii may
be considered:
19
Chazal, Levrard and Royer
2 2
• for β(q) 6 Cρq , ρ < 1, choosing q = c0 pk2 Mκ2 log(n) for a large enough c0 is enough to
min 0
satisfy the requirements of Theorem 6, and yields the same result as in the i.i.d. case
((Chazal et al., 2021, Theorem 10)).
2
• for β(q) = Cq −a , β −1 (u) = (C/u)1/a . An optimal choice for q is then Cn a+2 , leading to a
2
convergence rate in n−1+ a+2 .
Let us mention here that the stronger condition in Theorem 6 leads to a slower convergence
bound for the polynomial decay case, compared to the output of Algorithm 3. Again,
assessing optimality of exposed convergence rates remains an open question, up to our
knowledge.
Tα 7→ 1s(v)>tn,α ,
where s is the score function built in Section 2.3 and tn,α will be built from sample to achieve
a type I error rate below α.
To keep it simple, we assume that Σ and µ in (4) are computed from a separate sample,
so that we observe
for a suitable δ < α. In what follows we denote by t̂ such an empirical choice of threshold. The
following result ensures that this natural strategy remains valid in a dependent framework.
20
Topological Analysis for Detecting Anomalies in Dependent Data
In other words, Proposition 7 ensures that the anomaly detection test 1kṽk>t̂ has a type I
error below α, with high probability. Roughly, this bound ensures that, for confidence levels
α above the statistical uncertainty of order q/n, tests with the prescribed
p confidence level
may be achieved via increasing the threshold by a term of order αq/n.
As for Theorem 5 or 6, choosing the smaller q that achieves β(q)2 /q 3 6 α/n3 optimizes
the probability bound in Proposition 7:
• for β(q) 6 Cρq , q of order C 0 log(n) is enough to satisfy β(q)/q 2 6 n−2 , providing the
same results as in the i.i.d. case, up to a log(n) factor.
3
• for β(q) = Cq −a , an optimal choice for q is of order n 2a+3 α−1/(2a+3) , that leads to the
2a
same bounds as in the i.i.d. case, but with useful sample size n/q = n 2a+3 α1/(2a+3) .
Although this bound might be sub-optimal, it can provides some pessimistic prescriptions
to select a threshold α − δ, provided that the useful sample sizepn/q is known. For instance,
assuming log(n) 6 6, for α = 5% the minimal δ is of order 2.7/ n/q, that is negligible with
respect to α whenever n/q is large compared to roughly 3000.
21
Chazal, Levrard and Royer
D
bounded by d+1 × kωk∞ . We deduce that Xt ∈ MNmax (R, M ) (Definition 2), with R 6 4,
D D
Nmax 6 d+1 , and M 6 d+1 × kωk∞ . Note that in the experiments, we set ω ≡ 1.
Mixing coefficients: Here we expose how the mixing coefficients of (Xt )t=1,...,n (Definition
4) may be bounded in terms of those of (Yt )t∈[0;L] . Let us denote these coefficients by β
and β̃. If the stride s is larger than the window size ∆, then it is immediate that, for all
q > 1, β(q) 6 β̃(qs − ∆). If the stride s is smaller (or equal) than ∆, then, denoting by
q0 = b(∆/s)c + 1, we have, for q < q0 , β(q) 6 1 (overlapping windows), and, for q > q0 ,
β(q) 6 β̃(qs − ∆). The mixing coefficients of Xt may thus be controlled in terms of those
of Yt . For fixed ∆ and s, this ensures that mixing coefficients of Xt and Yt have the same
profile (and leads to the same convergence rates in Theorem 5 and 6).
• If β̃(q) 6 CY q −a , for CY , a > 0, then, for any q > q0 , β(q) 6 CY (s − ∆/q0 )−a q −a , so
that β(q) 6 CX (qs)−a , for some constant CX (depending on q0 and a).
• If β̃(q) 6 CY ρ̃q , for CY > 0 and ρ̃ < 1, then, for any q > q0 , β(q) 6 CY (ρ̃(s−(∆/q0 )) )q , so
that β(q) 6 CX ρqs , for some CX > 0 and ρ < 1 depending on q0 and ρ̃.
In turn, mixing coefficients of Yt may be known or bounded, for instance in the case where
it follows a recursive equation (see, e.g., Pham and Tran 1985, Theorem 3.1), or inferred
(see, e.g., McDonald et al. 2015). Interestingly, the topological wheels example provided in
Section 4.1 (borrowed from Bourakna et al. 2022) falls into the sub-exponential decay case.
This relation between the mixing coefficients of (Yt )t∈[0;L] and those of (Xt )t=1,...,n allows
to shed some light on the influence of the stride parameter s on the convergence results.
Assume for simplicity that the original time series is discrete with ñ points and ∆ is fixed.
In the two examples above it holds, for q > b∆c + 1, β(q) 6 C β̃(qs), where C depends on ∆
and the parameters of β̃ only. For a choice of stride s, the resulting sample size is n = ñ/s,
so that an optimal choice of q ≥ q0 with respect to Theorem 5 should satisfy
β(q) 1 β̃(qs) 1
= ⇔ C = ,
q n (qs) ñ
to provide a convergence rate in q/n = (qs)/ñ. Let q̃n be such that β̃(q̃n ) = q̃n /ñ (optimal
choice of q w.r.t. β̃). Then, provided s is not too large so that q̃n /s ≥ q0 , an optimal choice
of q w.r.t. the above bounds on β is qn = q̃n /s, leading to a convergence rate in q̃n /ñ,
whatever the chosen s. This backs the intuition that any reasonable choice of stride s (not
too large) should lead to the same theoretical guarantees.
Margin condition: The only point that cannot be theoretically assessed in general for
the outputs of Algorithm 1 is to know whether E(X) satisfies the margin condition exposed
in Definition 3. As explained below Definition 3, a margin condition holds whenever E(X)
is concentrated enough around k poles. Thus, structural assumptions on 1 − Corr(Y[0;∆] )
(for instance k prominent loops) might entail E(X) to fulfill the desired assumptions (as in
Levrard 2015 for Gaussian mixtures). However, we strongly believe that the requirements of
Definition 3 are too strong, and that convergence of Algorithms 3 and 4 may be assessed
with milder smoothness assumptions on E(X). This falls beyond the scope of this paper,
and is left for future work. The experimental Section 4 to follow assesses the validity of our
algorithms in practice.
22
Topological Analysis for Detecting Anomalies in Dependent Data
4. Applications
In order to make the case for the efficiency of our proposed anomaly detection procedure
TADA, we now present an assortment of both real-case and synthetic applications. The
first application (introduced as the Topological Wheels problem) is directly derived from
Bourakna et al. (2022). It consists of a synthetic data set designed to mimic dependence
patterns in brain signals, and allows to demonstrate the relevance of a topologically based
anomaly detection procedure on such complex data. The second application is an up-to-date
replication of a benchmark with the TimeEval library from Schmidl et al. (2022) on a large
array of synthetic data sets to quantitatively demonstrate competitiveness of the proposed
procedure with current state-of-the-art methods. The third application is a real-case data set
from Jacob et al. (2021-07) consisting of data traces from repeated executions of large-scale
stream processing jobs on an Apache Spark cluster. Lastly we produce interpretability
elements for the anomaly detection procedure TADA.
Evaluation of an anomaly detection procedure in the context of time series data has
many pitfalls and can be hard to navigate, we refer to the survey of Sørbø and Ruocco
(2023). Here we mainly evaluate anomaly scores with the robustified version of the Area
Under the PR-Curve: the Range PR AUC metric of Paparrizos et al. (2022a) (later just
’RANGE PR AUC’), where a metric of 1 indicates a perfect anomaly score, and a metric
close to 0 indicates that the anomaly score simply does not point to the anomalies in
the data set. For the sake of comparison with the literature we also include the Area
Under the ROC-Curve metric (later just ’ROC AUC’), although this metric is less accurate
and powerful in the unbalanced context of anomaly detection (Sørbø and Ruocco 2023).
Therefore each collection of anomaly detection problems will yield evaluation statistics, and
to summarize comparisons between algorithms we use a critical difference diagram, that is a
statistical test between paired populations using the package Herbold (2020). Furthermore
we introduce two other statistical summaries of interest:
• the ’# > .9’ metric, that is the number of anomaly detection problems for which an
algorithm has a RANGE PR AUC over .9. A RANGE PR AUC over .9 roughly indicates
that the algorithm ’finds well’ the anomalies in the data set or ’solves’ the problem;
• the ’#rank1’ metric, that is the number of problems for which an algorithm reaches the
best RANGE PR AUC score over other algorithms, ties being shared.
For the purpose of comparison with the state-of-the-art we draw methods from the recent
benchmark of Schmidl et al. (2022). We take the three best performing methods from the
unsupervised, multivariate category: the ’KMeansAD’ anomaly detection based on k-means
cluster centers distance using ideas from Yairi et al. (2001), the baseline density estimator
k-nearest-neighbors algorithm on time series subsequences ’SubKNN’, and ’TorskAD’ from
Heim and Avery (2019), a modified echo state network for anomaly detection.
Furthermore, in order to better understand the value of the introduced topological
methodology, we compare to a close-resembling method that couples the topological features
of Algorithm 5 to the isolation forest algorithm from Liu et al. (2008), resulting in an
unsupervised anomaly detection method denominated as ’Atol-IF’ in reference to the Royer
et al. (2021) paper. We also couple those topological features to a random forest classifier
Breiman (2001-10), resulting in a supervised anomaly detection method denominated as
23
Chazal, Levrard and Royer
’Atol-RF’ that gives an idea on what can be achieved in the supervised context. Lastly, to
investigate the differences between the proposed topological analysis and a more standard
spectral analysis, we compute spectral features on the correlation graphs coupled to either
an isolation forest or to a random forest classifier, in an unsupervised anomaly detection
method denominated as ’Spectral-IF’ and a supervised one named ’Spectral-RF’.
In practice all those methods involve a form of time-delay-embedding or subsequence
analysis or context window analysis (we use these terms synonymously in this work), that
requires to compute a prediction from a window size number ∆ of past observations. ∆ is
a key value that acts as the equivalent of image resolution or scale in the domain of time
series. In using a subsequence analysis, given a ∆-uplet of timestamps [t] := (t1 , t2 , ..., t∆ ),
once an anomaly score s[t] is produced it is related to that particular ∆-uplet but does not
refer to a specific time step. A window reversing step is needed to map the scores to the
original timestamps. For fair comparison, we will provide all methods with the following
(same) last-step window reversing procedure: P for every time step t, one computes the sum of
windows containing this time step ŝt := [t0 ]:t∈[t0 ] s[t0 ] . Here we choose not to use the more
P P
classical average s̃t := [t0 ]:t∈[t0 ] s[t0 ] / [t0 ]:t∈[t0 ] 1 , since this average produces undesirable
border effects (the timestamps at the beginning and end of the signal are contained in less
windows, making them over-meaningful after averaging). Using the sum instead has no
effect on anomaly scoring (outside of borders) as the metrics are scale-invariant.
For the specific use of TADA in this section, the centroids computation part of Section
2.2.1 is made using ω(b,d) = 1 and computed with the batch version described in Algorithm 3.
Our implementation relies on The GUDHI Project (2015) for the topological data analysis
part, but also makes use of the Pedregosa et al. (2011) Scikit-learn library for the anomaly
detection part, minimum covariance determinant estimation and overall pipelining. The
code is published as part of the ConfianceAI program https://catalog.confiance.ai/
and can be found in the catalog: https://catalog.confiance.ai/records/4fx8n-6t612.
All computations were made on a standard laptop (i5-7440HQ 2.80 GHz CPU).
24
Topological Analysis for Detecting Anomalies in Dependent Data
two pairs. The first mode of dependence is the prominent mode for the time series duration,
and is replaced for a short period at a random time by the second mode of dependence. The
total signal involves 64 time series sampled at 500 Hz for a duration of 20 seconds, see Figure
7. We produce ten such data sets and call them the Topological Wheels problem. It consists
in being able to detect the change in underlying patterns without supervision. We note
that by design the two modes are similar in their spectral profile, so detecting anomalies
should be hard for methods that do not capture the overall topology of the dependence
structure. The data sets are available through the public The GUDHI Project (2015) library
at github.com/GUDHI/gudhi-data.
Figure 7: Left: Synthetised time series and the latent generating process (orange) indicating normal
connection (double circular wheel with middle connection, on top) or abnormal connection (double
circular wheel with two connections, on the bottom). Right: Anomaly scores of all tested methods
on one of the data sets, and their RANGE PR AUC metric on comparing with the truth (bottom
row) in parenthesis.
25
Chazal, Levrard and Royer
the intuition that methods that do not rely on topology, that is the spectral method, the
k-nearest-neighbor method and the modified echo state network method all fail to capture
the anomaly. This is particularly striking for the spectral method as it was trained with
supervision. On the other hand all methods based on the topological features manage to
capture some indication that there is anomaly in the signal. For the isolation forest method,
even though it clearly separates the anomalous segment from the rest, it is not reliable as it
seems to indicate other anomalies when there aren’t. The random forest supervised method
perfectly discriminates the anomalous segment from the rest of the time series, and so does
our method almost as reliably.
Figure 8: Left: aggregated results for the Topological Wheels problem in the form of box plots for
the ROC AUC and RANGE PR AUC metrics, and below them the points the metrics have been
computed from - where each point represents a metric score from comparing an anomaly score to
the underlying truth. Right: autorank summary (see Herbold (2020)) ranking of methods on the
Topological Wheels problem, showing competitiveness between our unsupervised TADA method and
the equivalent topological supervised method.
We now look at the aggregated results for the entire problem, see Figure 8 and times
Table 1. We present both ROC AUC and RANGE PR AUC averages with their standard
deviations over experiments, as well as the computation times for the sake of completeness.
Neither the spectral procedure nor the echo state network, subKNN or k-nearest-neighbor
methods are able to capture any information from the Topological Wheels problem. Using
topological features with an isolation forest yields competitive results but it is simply inferior
to our procedure. This demonstrates that the key information to this problem lies in the
topological embedding which is not surprising, by design. Our procedure solves this problem
almost perfectly, and although it is unsupervised it is as competitive as a comparable
supervised method. This experiment demonstrates the impact of topology-based methods
for anomaly detection, as the non-topology methods fail to capture any of the signal in the
26
Topological Analysis for Detecting Anomalies in Dependent Data
Table 1: Summary statistics on the Topological Wheels problem for the algorithms evaluated. All
methods could produce scores for the 90 experiments, and without surprise the methods relying on
topological analysis are overwhelmingly dominating other methods in 86 out of 90 experiments. Our
unsupervised method TADA is on par with a supervised learning method for the number of problem
where it has the best PR AUC score (#rank1 column). In seconds, the computation median time
(’med time’) and interquartile range (’iqr time’) are standard with respect to the data sizes - also
note that computations are not optimized, and in fact performed on a single laptop.
data sets. Our proposed TADA method is clearly the best suited for learning anomalies in
this setup.
27
Chazal, Levrard and Royer
Figure 9: Left: aggregated results for the TimeEval 1084 GutenTAG synthetic data sets in the
form of box plots for the ROC AUC and RANGE PR AUC metrics, and below them the points the
metrics have been computed from - where each point represents a metric score from comparing an
anomaly score to the underlying truth. Right: autorank summary (see Herbold (2020)) ranking of
methods, showing fourth place ranking for our purely topological TADA method.
Table 2: Summary statistics on the GutenTAG problem set for the algorithms evaluated. Even
though it is ranked fourth by the statistical pairwise-ranking in Figure 9, our unsupervised method
TADA is able to solve roughly half of the problems, and is competitive on 231 of them. The other
topological method Atol-IF ranks second and is able to solve 859 of the 1174 problems. In seconds,
the computation median time (’med time’) and interquartile range (’iqr time’) are standard with
respect to the data sizes involved - computations are performed on a single laptop. TorskAD failed
to return scores on a number of data sets due to unknown bugs.
Statistical summaries and results of the experiment on the synthetic data sets are shown
Figure 9, and in Table 2. As a reminder, the SubKNN, KMeansAD and TorskAD methods
were the top three performing methods from the largest anomaly detection benchmark
to date (see Table 3 from Schmidl et al. 2022). Our TADA procedure manages to solve
roughly half of the problems and is a top contender among competitors for about a quarter
of them. Atol-IF performs better than TADA in this instance, which is not surprising as
28
Topological Analysis for Detecting Anomalies in Dependent Data
Figure 10: Correlations between algorithm RANGE PR AUC scores on the GutenTAG data set
collections. The two topological methods TADA and Atol-IF correlate at .86, and the highest
correlation reaches .87 for Atol-IF and Spectral-IF. This indicates that the algorithms involved each
solve different GutenTAG problems although some overlap naturally occurs.
isolation forest retains much more information from training than TADA, which also implies
heavier memory loads. Overall SubKNN is able to perform the best on those data sets, and
TADA and Atol-IF show good performances, and in some instances only the topological
methods manage to solve the problem, see for instance Figure 11. These results demonstrate
competitiveness of our methodology in the unsupervised anomaly detection learning context.
29
Chazal, Levrard and Royer
Figure 11: Left: zoom in on a GutenTAG problem instance (sensors selected for visualization) with
a variance anomaly for the last two sensors (purple and red), and the ground truth (last row). Right:
problem ground truth (top row) and the corresponding anomaly scores of each method with their
RANGE PR AUC metric in parenthesis. The topological methods TADA and Atol-IF get a good
metric score and manage to find the anomalies, while no other method does.
Figure 12: Left: aggregated results for the 35 Exathlon real data sets in the form of box plots for
the ROC AUC and RANGE PR AUC metrics, and below them the points the metrics have been
computed from - where each point represents a metric score from comparing an anomaly score to the
underlying truth. Right: autorank summary (see Herbold (2020)) ranking of methods on the real
data sets, showing second and third place ranking for the topological methods Atol-IF and TADA.
methods are strong competitors for these data sets, with TADA coming off as the most often
number one ranked method. Due to the real nature of the data sets, it is not surprising that
the studied methods do not ’solve’ them as well as they solve the GutenTAG data sets or the
30
Topological Analysis for Detecting Anomalies in Dependent Data
Table 3: Summary statistics on the Exathlon real data problems for the algorithms evaluated. Even
though it is ranked third by statistical ranking in Figure 12, our unsupervised method TADA is
the top RANGE PR AUC score (#rank1 column) over all problems, which hints that it is able to
solve different sorts of anomaly detection problems than the others. In seconds, the computation
median time (’med time’) and interquartile range (’iqr time’) are high for the topological methods,
see commentaries in the text. Computations are performed on a single laptop.
TopologicalWheels data sets. We show in Figure 13 the one instance where TADA is able to
solve the problem completely, and highlight that no calibration has been needed to do so.
Figure 13: Left: zoom in on data set 3 2 1000000 71. Right: ground truth (top row) and anomaly
scores for the six methods and their RANGE PR AUC score in parenthesis. While all methods are
able to locate the beginning of the anomaly period, only TADA manages to catch it in its entirety.
One drawback of the topological methods appearing here is the high variance in execution
time, which originates from computing topological features on a great number of sensors.
Since our implementation of Algorithm 6 is naive, we point that there are strategies for
optimizing computation times: ripser, subsampling, clustering that makes sense, etc. Those
strategies are outside the scope of this paper.
31
Chazal, Levrard and Royer
where µ̂, Σ̂ are the estimated mean and covariance of the vectorization v of Algorithm 5
(time indices are implied and omitted for this discussion). These scores can be interpreted as
testing for anomalies with respect to a single embedding dimension, as if the vectorization
had independent components. These center-targeted scores allow to analyze an original
anomaly by looking at the score deviations of each vector component. Because the vector
components are integrated from a learned centroid, the scores can be traced back to a specific
region in R2 , see for instance Figure 14.
Figure 14: Left: Overall score (top curve, in blue), scores for s̃i (middle curves, three top ones
for features corresponding to homology dimension 0, three bot ones for features corresponding to
homology dimension 1) and ground truth (bottom curve) on a single Topological Wheels data set.
The last homology dimension 1 has the strongest correspondence to the overall score, and matches
the underlying truth almost exactly. Right: s̃i scores of an abnormal persistence diagram on this
data set, next to the associated quantization centers (colored stars) of Algorithm 3. The scores are
written in a font size proportional to them, so that the more abnormal scores appear bigger. In this
instance the quantization centers in dimension 0 and 1 with the highest persistence react to this
diagram, hinting at a change in the latent data structure as the highest persistence diagram points
are usually associated with signal in comparison with points nearer the diagonal.
This leads to valuable interpretation. For instance if an abnormal score of TADA were
caused by a large deviation in a homology dimension 1 center-related component, it is likely
that at that time an abnormal dependence cycle is created for a longer or shorter period of
32
Topological Analysis for Detecting Anomalies in Dependent Data
time than for the rest of the time series, therefore that the dependence pattern has globally
changed in that period of time. See for instance an illustration on the Topological Wheels
problem in Figure 14 where globally changing the dependence pattern between sensors is
exactly how the abnormal data was produced. In the case where the produced score is
deemed abnormal due to a shift in several dimension 0 centers-related components, this
indicates an anomaly in the connectivity dependence structure that does not affect the
higher order homological features. In this case, the anomaly could be attributed to a default
(such as a breakdown) in one of the original sensors for instance.
5. Conclusion
It is common knowledge that no anomaly detection method can help with identifying all kinds
of anomalies. The framework introduced in this paper is relevant for detecting abnormal
topological changes in the global structure of data.
For the motivating example of detecting changes in the dependence patterns between
channels of a multivariate time series, our method turns out to be competitive with respect
to other state-of-the-art approaches on various benchmark data sets. Naturally, there are
many different sorts of anomalies that the proposed method is not able to detect at all. For
instance, since the topological embedding is invariant to graph isomorphism, any anomalies
linked to node permutation (change of node labeling) cannot be caught. The same is
true for homothetic jumps: when signals would simultaneously get identically multiplied,
the correlation-based similarity matrix would remain unchanged, leading to unchanged
topological embedding. While such invariances are generally thought of as limitative, they
can also be a welcome feature if the considered problem introduces the same limitation - for
instance when there is labeling uncertainties in sensors collecting data.
The topological anomaly detection finds anomalies that other methods do not seem to
discover. It is generally understood that topological information is a form of global informa-
tion that is complementary to the information gathered by more traditional approaches, e.g.
spectral detectors. While confirming this, the above numerical experiment also suggest that
topological information is commonly present in various real or synthetic data sets. Therefore
for practical applied purposes it is probably best to use our method in combination with
other dedicated methods, for instance one that focuses on ’local’ data shifts such as the
SubKNN method.
Focusing on the case of detection of anomalies in the dependence structure of multivariate
time series, it appears that the only parameter that requires a careful tuning in our method
is the window size (or temporal resolution) ∆, as it probably also does for most of existing
procedures (see Section 4). Designing methods to empirically tune this window size, or to
combine the outputs of our method at different resolutions would be a relevant addition to
our work, that is left for future research.
Finally, let us recall here the flexibility of the framework we introduce. First, it is not tied
to detect changes in correlation structures in time series: we may use Algorithm 1 with other
dissimilarity measures between channels, and more generally the two last steps may apply
verbatim whenever a sequence of persistence diagrams may be built from data (for instance
arising from filtrations on meshed shapes, images, graphs etc.). Second, the vectorization
we propose with Algorithm 3 and 5 does not necessarily takes a sequence of persistence
33
Chazal, Levrard and Royer
diagrams as input: any sequence of measures may be vectorized the same way. It may
find applications in monitoring sequences of point processes realizations, as in evolution of
distributions of species for instance - see, e.g., Renner et al. (2015). Finally, one may process
the output of the vectorization procedure in other ways than building an anomaly score.
For instance, using these vectorizations as inputs of any neural network, or change-points
detection procedures such as KCP (Arlot et al. 2019) could provide a dedicated method to
retrieve change points of a global structure.
6. Acknowledgements
This work has been supported by the French government under the ’France 2030’ program,
as part of the SystemX Technological Research Institute within the Confiance.ai project.
The authors also thank the ANR TopAI chair in Artificial Intelligence (ANR–19–CHIA–0001)
for financial support.
The authors are grateful to Bernard Delyon for valuable comments and suggestions
concerning convergence rates in β-mixing cases.
7. Proofs
Most of the proposed results are adaptations of proofs in the independent case to the
dependent one. A peculiar interest lies in concentration results in this framework, we list
important ones in the following section.
Then there exists a sequence (X̃i )i>1 of independent random variables such that, for any
i > 1, X̃i and Xi have the same distribution and P(Xi 6= X̃i ) 6 bi .
The above lemma allows to translate standard concentration bounds from the i.i.d. framework
to the dependent case, where dependence is seized in terms of β-mixing coefficients.
Let us recall here the general definition of β-mixing coefficients from Definition 4. For a
sequence of random variables (Zt )t∈Z (not assumed stationary), the beta-mixing coefficient
of order q is
" #
β(q) = sup E sup |P(B | σ(−∞, t)) − P(B)| .
t∈Z B∈σ(t+q,+∞)
34
Topological Analysis for Detecting Anomalies in Dependent Data
where dT V denotes the total variation distance. We will make use of the following adaptation
of Bernstein’s inequality to the dependent case.
Theorem 9 (Doukhan, 1994, Theorem 4) Let (Xt )t∈Z be a sequence of (real) variables with
β-mixing coefficients (β(q))q∈N∗ , that satisfies
1. ∀t ∈ Z E(Xt ) = 0,
Pn 2
2. ∀t, n ∈ Z × N E j=1 Xt+j 6 nσ 2 ,
3. ∀t |Xt | 6 M a.s..
To apply Theorem 9, a bound on the variance term is needed. Such bounds are available in
the stationary case under slightly milder assumptions (see, e.g., Rio 1993). For our purpose,
a straightforward application of (Rio, 1993, Theorem 1.2, a)) will be sufficient, exposed
below.
Lemma 10 Let Xt denote a centered and stationary sequence of real variables with β-mixing
coefficients (β(q))q∈N∗ , such that |Xt | 6 M a.s..
Then it holds
2
n Z 1
1 X
E Xj 6 4M 2
β −1 (u)du,
n 0
j=1
where β −1 (u) =
P
k∈N 1β(k)>u .
• For every k > 1 Yk has the same distribution as Ỹk , and P(Ỹk 6= Yk ) 6 β(q).
35
Chazal, Levrard and Royer
• The random variables (Y2k )k>1 are independent, as well as the variables (Y2k−1 )k>1 .
For any c ∈ B(0, R)k , we denote by m̂(c) (resp. m̃(c)) the vector of centroids defined by,
for all j = 1, . . . , k,
¯ (du) u1
h i h i
X̄n (du) u1Wj (c) (u) X̃n Wj (c) (u)
m̂(c)j = , m̃(c)j = ,
p̂j (c) p̃j (c)
¯ (W (c))), adopting the convention
where p̂j (c) (resp. p̃j (c)) denotes X̄n (Wj (c)) (resp. X̃n j
m̂j (c), m̃j (c) = 0 when the corresponding cell weight is null.
The following lemma ensures that m̂(c) contracts toward c∗ , provided c ∈ B(c∗ , R0 ).
Lemma 11 With probability larger than 1 − 16e−c1 npmin /(qM ) − 2e−x , it holds, for every
c ∈ B(c∗ , R0 ),
2 n
!2
3 Dn/q kR 2M 2 1 X
km̂(c) − c∗ k2 6 kc − c∗ k2 + C1 2 + C2 2 1Xi 6=X̃i ,
4 pmin pmin n
i=1
√ p √
RM q
where Dn/q = √n k d log(k) + x .
The proof of Lemma 11 is postponed to Section 7.2.3. Equipped with Lemma 11, we first
prove recursively that, if c(0) ∈ B(c∗ , R0 ), then w.h.p., for all t > 0 c(t) ∈ B(c∗ , R0 ). We let
Ω1 be defined as
!2
n
2 2 1 X
2 2
Ω1 = C2 kR M 1Xi 6=X̃i /pmin 6 R0 /8 .
n
i=1
P 2 P
1 n 1 n
Noting that E n i=1 1Xi 6=X̃i 6E n i=1 1Xi 6=X̃i = β(q), Markov inequality yields
kM 2
P(Ωc1 ) 6 C β(q).
κ20 p2min
Choosing x = c1 (n/q)κ20 p2min /M 2 in Lemma 11, for c1 small enough yields, for (n/q) large
enough,
3 R2 R2
km̂(c) − c∗ k2 6 R02 + 0 + 0 = R02 ,
4 8 8
2 2 2 2
with probability larger than 1 − 18e−c1 nκ0 pmin /qM − C κkM 2 p2 β(q), provided c ∈ B(c∗ , R0 ).
0 min
Denoting by Ω2 the probability event onto which the above equation holds, a straightforward
recursion entails that, if c(0) ∈ B(c∗ , R0 ), then, for all t > 1 c(t) = m̂(c(t−1) ) ∈ B(c∗ , R0 ), on
Ω2 .
Then, using Lemma 11 iteratively yields that, on Ω2 ∩ Ωx , where P(Ωcx ) 6 2e−x , for all
t > 1, provided c(0) ∈ B(c∗ , R0 ),
t 2 n
!2
3 Dn/q kR 2M 2 1 X
kc(t) − c∗ k2 6 kc(0) − c∗ k2 + C1 2 + C2 2 1Xi 6=X̃i . (9)
4 pmin pmin n
i=1
36
Topological Analysis for Detecting Anomalies in Dependent Data
Theorem 5 now easily follows. For the first inequality, let t > 1, then, using Markov
inequality again gives
n
! r
1X p n
P 1Xi 6=X̃i > q/n 6 β(q).
n q
i=1
Zt =
!2
t 2 2 2 2 2 n
kc(t) − c∗ k2 − 3 kc(0) − c∗ k2 − C1 qR M k d log(k) − C2 kR M 1X
1Xi 6=X̃i 1Ω2 ,
4 np2min p2min n
i=1
+
R2 M 2 q
P Zt > C x 6 P(Ωcx ) 6 2e−x .
n
We deduce that
qR2 M 2
E(Zt ) 6 C ,
n
which leads to
Noting that
!2
n n
!
1X 1X
E 1Xi 6=X̃i 6E 1Xi 6=X̃i = β(q),
n n
i=1 i=1
37
Chazal, Levrard and Royer
and using
kM 2
−c1 nκ20 p2min /qM 2
P(Ωc2 ) 6 e + 2 2 β(q)
κ0 pmin
kM 2
6C (β(q) + (q/n))
κ20 p2min
qkM 2
6C
nκ20 p2min
whenever β(q) 6 q/n leads to the result.
• For every k > 1 Yk has the same distribution as Ỹk , and P(Ỹk 6= Yk ) 6 β(q).
• The random variables (Y2k )k>1 are independent, as well as the variables (Y2k−1 )k>1 .
Let A⊥⊥ denote the event
n o
A⊥⊥ = ∀j = 1, . . . , n/q Yj = Ỹj .
A standard union bound yields that P(Ac⊥⊥ ) 6 nq β(q). On A⊥⊥ , the minibatches used by
Algorithm 4 may be considered as independent, so that the main lines of (Chazal et al.,
2021, Proof of Lemma 18) readily applies, replacing Xi ’s by X̃i ’s. In what follows we let c̃(t)
denote the output of the t-th iteration of Algorithm 4 based on X̃1 , . . . , X̃n .
2
Assume that n > k, and q > C pM 2 log(n), for a large enough constant C that only
R 1 −1 min
depends on 0 β (u)du, to be fixed later. For t 6 n/(4q) = T , let At,1 and At,3 denote the
events
n pmin o
At,1 = ∀j = 1, . . . , k |p̃j (t) − pj (t)| 6 ,
( 128 )
Z r
(t) ¯ M kdpmin
At,3 = ∀j = 1, . . . , k (c̃j − u)1Wj (c̃(t) ) (u)(X̃B (3) − E(X))(du) 6 8R ,
t C
¯
where p̃j (t) = X̃ (t) )). Then, according to Theorem 9 with x = 2 log(n) and Lemma
(1) (Wj (c̃
B
t
10 to bound the corresponding σ, for j ∈ {1, 3}, P(At,j ) 6 4dk/n2 + 2kdβ(qn /18), for n large
enough.
Further, define
\
A6t = Aj,1 ∩ Aj,3 .
j6t
38
Topological Analysis for Detecting Anomalies in Dependent Data
2 2 R1
Then, provided that q > c0 kp2 dMκ2 log(n), where c0 only depends on 0 β −1 (u)du, we may
min 0
prove recursively that
∀p 6 t c̃(p) ∈ B(c∗ , R0 )
on A6t whenever c̃(0) = c(0) ∈ B(c∗ , R0 ) (first step of the proof of (Chazal et al., 2021,
Lemma 18)).
Next, denoting by at = kc̃(t) − c∗ k2 1A6t , we may write
with
R1 6 4kR2 P(Act+1,3 )
recalling that β(q/18)/q 2 6 n−2 . Proceeding as in (Chazal et al., 2021, Proof of Lemma 18),
we may further bound
12kdM R2
2 − K1
E(kc̃(t+1) − c∗ k2 1At+1,1 1A6t ) 6 1 − E(at ) + ,
t+1 pmin (t + 1)2
for some K1 6 0.5. Noticing that k 6 M/pmin and t + 1 6 T = n/(4q) yields that
14kdM R2
2 − K1
E(at+1 ) 6 1 − E(at ) + .
t+1 pmin (t + 1)2
Following (Chazal et al., 2021, Proof of Theorem 10), a standard recursion entails
28kdM R2
E(at ) 6 ,
pmin t
for t 6 n/(4q). At last, since kc̃(T ) − c∗ k2 1A⊥⊥ = kc(T ) − c∗ k2 1A⊥⊥ , we conclude that
39
Chazal, Levrard and Royer
The first term of the right hand side may be controlled using a slight adaptation of
(Chazal et al., 2021, Lemma 22).
Lemma 12 With probability larger than 1 − 16e−x , for all c ∈ B(0, R)k and j ∈ [[1, k]], it
holds
r
8M c0 q log(k) log(2nNmax ) 8M qx
q
p̃j (c) > pj (c) − pj (c) + ,
n n
8M c0 q log(k) log(2nNmax ) 8M qx
p̃j (c) 6 pj (c) + +
r n n
8M c0 q log(k) log(2nNmax ) 8M qx
q
+ + pj (c),
n n
where c0 is an absolute constant. Moreover, with probability larger than 1 − 2e−x , it holds
Z
sup ¯ − E(X))(du)
(cj − u)1Wj (c) (u)(X̃n
c∈B(0,R)k j=1,...,k
√
RM q p √
6 C0 √ k d log(k) + x ,
n
where C0 is an absolute constant.
Proof [Proof of Lemma 12] We intend here to recover the standard i.i.d. bounds given in
(Chazal et al., 2021, Lemma 22). To this aim, we let p̃j,0 (c) and p̃j,1 (c) be defined by
n/2q
2q X ¯
p̃j,r (c) = Ỹ 2s−r (Wj (c)),
n
s=1
for r ∈ {0, 1}, where Ỹ¯ 2s−r = 1q t=(2s−r−1)q+1 X̃t is a measure in M(R, M ), with total
P(2s−r)q
number of support points bounded by qNmax , and remark that
1
p̃j (c) = (p̃j,0 (c) + p̃j,1 (c)) .
2
Since E(Ȳ˜2s−r )(Wj (c)) = pj (c), and the p̃j,r (c)’s are sums of n/2q independent measures
evaluated on Wj (c), we may readily apply (Chazal et al., 2021, Lemma 22) replacing n by
n/(2q) to each of them, leading to the deviation bounds on the p̃j (c)’s.
For the third inequality of Lemma 12, denoting by
n/2q
¯ 2q X ¯
X̃n,j = Ỹ 2s−j ,
n
s=1
40
Topological Analysis for Detecting Anomalies in Dependent Data
The precise derivation of (11) may be found in (Chazal et al., 2021, Proof of Lemma 17,
pp.34-35). Plugging (11) into (10) leads to, for a small enough K,
3 C
km̂(c) − c∗ k2 6 kc − c∗ k2 + 2 Dn/q
2
+ C2 km̂(c) − m̃(c)k2 ,
4 pmin
and
n
¯
h i 2RM X
(X̄n (du) − X̃n (du)) u1Wj (c) (u) 6 1Xi 6=X̃i .
n
i=1
41
Chazal, Levrard and Royer
¯ (du) u1
h i h i
X̄n (du) u1Wj (c) (u) X̃n Wj (c) (u)
km̂j (c) − m̃j (c)k = −
p̂j (c) p̃j (c)
h i 1 1
6 X̄n (du) u1Wj (c) (u) −
p̂j (c) p̃j (c)
1 ¯ (du)) u1
h i
+ (X̄n (du) − X̃n Wj (c) (u)
p̃j (c)
n
|p̂j (c) − p̃j (c)| 2RM X
6R + 1Xi 6=X̃i
p̃j (c) np̃j (c)
i=1
n
RM X
6C 1Xi 6=X̃i .
npmin
i=1
Squaring and taking the sum with respect to j gives the result.
n
1X
Fn (t) = 1Zi >t ,
n
i=1
F (t) = P(Z > t), and p t̂ be such that Fn (t̂) 6 α − δ. In the i.i.d. case, we might bound
supt (F (t) − Fn (t))/ Fn (t) using a standard inequality such as in (Boucheron et al., 2005,
Section 5.1.2).
As for the proofs of Theorem 5 and 6, we compare with the i.i.d. case by introducing
auxiliary variables.
We let Z̃1 , . . . , Z̃n be such that, denoting by Yk (resp. Ỹk ) the vector (Z(k−1)q+1 , . . . , Zkq )
(resp. (Z̃(k−1)q+1 , . . . , Z̃kq )), it holds
n
1X
F̃n (t) 6 Fn (t) + 1Zi 6=Z̃i .
n
i=1
42
Topological Analysis for Detecting Anomalies in Dependent Data
n P q o
If Ω1 is the event n1 ni=1 1Zi 6=Z̃i 6 αq
p
n , that has probability larger than 1−β(q) n/(αq)
(using Markov inequality as before), then on Ω1 it readily holds
r
αq
F̃n (t̂) 6 Fn (t̂) +
n
r
αq
6α−δ+ ,
n
so that we may write, on the same event,
It remains to control the stochastic term (F (t̂) − F̃n (t̂)). To do so, we denote by
(2j−1)q
1 X
X̃j,0 (t) = 1Z̃i >t ,
q
i=2(j−1)q+1
2jq
1 X
X̃j,1 (t) = 1Z̃i >t ,
q
i=(2j−1)q+1
for j ∈ [[1, n/(2q)]] and t ∈ R. Note that, for any σ ∈ {0, 1}, X̃j,σ ’s are i.i.d., take values in
[0, 1], and have expectation F (t). Next, we define, for 1 6 j 6 n/2q and t ∈ R,
n/2q
2q X
F̃n,0 (t) = X̃j,0 (t),
n
j=1
n/2q
2q X
F̃n,1 (t) = X̃j,1 (t),
n
j=1
and we note that F̃n (t) = 12 (F̃n,0 (t) + F̃n,1 (t)). Since the F̃n,σ ’s are sums of i.i.d. random
variables, the following concentration bound follows.
Lemma 13 For j ∈ {0, 1}, and x such that (n/2q)x2 > 1, it holds
!
(F (t) − F̃n,j (t)) 2
P sup p > 2x 6 2ne−(n/2q)x .
t∈R F (t)
43
Chazal, Levrard and Royer
which leads to
r s r
q log(n) αq q log(n)
q
F (t̂) 6 2 + α−δ+ +4 .
n n n
√ √
q q
Choosing δ > 4 α q log(n)
n + αq
n ensures that the right-hand side is smaller than α.
Following
(Chazal et al., 2021, Lemma 22), if SF (y1 , . . . , yn/(2q) ) denotes the cardinality of
f (y1 ), . . . , f (yn/(2q) ) | f ∈ F , we have to bound
where the Yi0 ’s are i.i.d. copies of the Yi ’s. Since, for every y1 , . . . , yn/q , recalling that
yi = (z(i−1)q+1 , . . . , ziq ), it holds
References
Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick
Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier.
Persistence images: a stable vector representation of persistent homology. J. Mach. Learn.
Res., 18:Paper No. 8, 35, 2017. ISSN 1532-4435.
Alekh Agarwal and John C. Duchi. The generalization ability of online algorithms for depen-
dent data. IEEE Trans. Inform. Theory, 59(1):573–587, 2013. ISSN 0018-9448,1557-9654.
doi: 10.1109/TIT.2012.2212414. URL https://doi.org/10.1109/TIT.2012.2212414.
Sylvain Arlot, Alain Celisse, and Zaid Harchaoui. A kernel multiple change-point algorithm
via model selection. J. Mach. Learn. Res., 20:Paper No. 162, 56, 2019. ISSN 1532-
4435,1533-7928.
44
Topological Analysis for Detecting Anomalies in Dependent Data
Jean-Daniel Boissonnat, Frédéric Chazal, and Mariette Yvinec. Geometric and Topological
Inference, volume 57. Cambridge University Press, 2018.
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: a survey
of some recent advances. ESAIM Probab. Stat., 9:323–375, 2005. ISSN 1292-8100,1262-3318.
doi: 10.1051/ps:2005018. URL https://doi.org/10.1051/ps:2005018.
Anass El Yaagoubi Bourakna, Moo K. Chung, and Hernando Ombao. Modeling and
simulating dependence in networks using topological data analysis, 2022.
Claire Brécheteau, Aurélie Fischer, and Clément Levrard. Robust Bregman clustering.
The Annals of Statistics, 49(3):1679 – 1701, 2021. doi: 10.1214/20-AOS2018. URL
https://doi.org/10.1214/20-AOS2018.
Mathieu Carrière, Frédéric Chazal, Yuichi Ike, Théo Lacombe, Martin Royer, and Yuhei
Umeda. Perslay: A neural network layer for persistence diagrams and new graph topological
signatures. In International Conference on Artificial Intelligence and Statistics, pages
2786–2796. PMLR, 2020.
Frédéric Chazal and Vincent Divol. The density of expected persistence diagrams and its
kernel based estimation. In 34th International Symposium on Computational Geometry,
volume 99 of LIPIcs. Leibniz Int. Proc. Inform., pages Art. No. 26, 15. Schloss Dagstuhl.
Leibniz-Zent. Inform., Wadern, 2018.
Frédéric Chazal and Bertrand Michel. An introduction to topological data analysis: funda-
mental and practical aspects for data scientists. Frontiers in artificial intelligence, 4:108,
2021.
Frédéric Chazal, Vin de Silva, and Steve Oudot. Persistence stability for geometric complexes.
Geometriae Dedicata, 173(1):193–214, 2014.
Frédéric Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of
persistence modules. Springer International Publishing, 2016.
Frédéric Chazal, Clément Levrard, and Martin Royer. Clustering of measures via mean
measure quantization. Electronic Journal of Statistics, 15(1):2060 – 2104, 2021. doi:
10.1214/21-EJS1834. URL https://doi.org/10.1214/21-EJS1834.
Stéphane Chrétien, Ben Gao, Astrid Thebault-Guiochon, and Rémi Vaucher. Time topologi-
cal analysis of eeg using signature theory, 2024.
Meryll Dindin, Yuhei Umeda, and Frederic Chazal. Topological data analysis for arrhythmia
detection through modular neural networks, 2019.
45
Chazal, Levrard and Royer
Vincent Divol and Frédéric Chazal. The density of expected persistence diagrams and its
kernel based estimation. Journal of Computational Geometry, 10(2):127–153, 2019.
P. Doukhan, P. Massart, and E. Rio. Invariance principles for absolutely regular empirical
processes. Ann. Inst. H. Poincaré Probab. Statist., 31(2):393–427, 1995. ISSN 0246-0203.
URL http://www.numdam.org/item?id=AIHPB_1995__31_2_393_0.
Paul Doukhan. Mixing, volume 85 of Lecture Notes in Statistics. Springer-Verlag, New York,
1994. ISBN 0-387-94214-9. doi: 10.1007/978-1-4612-2642-0. URL https://doi.org/10.
1007/978-1-4612-2642-0. Properties and examples.
Petri G., Expert P., Turkheimer F., Carhart-Harris R., Nutt D., Hellyer P.J., and Vaccarino
F. Homological scaffolds of brain functional networks. Journal of the Royal Society
Interface, 11(101), 2014. doi: 10.1098/rsif.2014.0873.
Niklas Heim and James E. Avery. Adaptive anomaly detection in chaotic time series
with a spatially aware echo state network. ArXiv, abs/1909.01709, 2019. URL https:
//api.semanticscholar.org/CorpusID:202541761.
Steffen Herbold. Autorank: A python package for automated ranking of classifiers. Journal
of Open Source Software, 5(48):2173, 2020. doi: 10.21105/joss.02173. URL https:
//doi.org/10.21105/joss.02173.
Thi Kieu Khanh Ho, Ali Karami, and Narges Armanfard. Graph-based time-series anomaly
detection: A survey. arXiv preprint arXiv:2302.00058, 2023.
Mia Hubert, Michiel Debruyne, and Peter J. Rousseeuw. Minimum covariance determinant
and extensions. Wiley Interdiscip. Rev. Comput. Stat., 10(3):e1421, 11, 2018. ISSN 1939-
5108,1939-0068. doi: 10.1002/wics.1421. URL https://doi.org/10.1002/wics.1421.
Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul.
Exathlon: a benchmark for explainable anomaly detection over time series. 14(11):
2613–2626, 2021-07. ISSN 2150-8097. doi: 10.14778/3476249.3476307. URL https:
//doi.org/10.14778/3476249.3476307.
Soham Jana, Jianqing Fan, and Sanjeev Kulkarni. A general theory for robust clustering via
trimmed mean, 2024.
Clément Levrard. Nonasymptotic bounds for vector quantization in hilbert spaces. Ann.
Statist., 43(2):592–619, 04 2015. doi: 10.1214/14-AOS1293. URL https://doi.org/10.
1214/14-AOS1293.
Clément Levrard. Quantization/clustering: when and why does k-means work. Journal de
la Société Française de Statistiques, 159(1), 2018.
46
Topological Analysis for Detecting Anomalies in Dependent Data
Shuya Li, Wenbin Song, Chao Zhao, Yifeng Zhang, Weiming Shen, Jing Hai, Jiawei Lu, and
Yingshi Xie. An anomaly detection method for multiple time series based on similarity
measurement and louvain algorithm. Procedia Computer Science, 200:1857–1866, 2022.
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE
International Conference on Data Mining, pages 413–422, 2008. doi: 10.1109/ICDM.2008.
17.
Daniel J. McDonald, Cosma Rohilla Shalizi, and Mark Schervish. Estimating beta-mixing
coefficients via histograms. Electron. J. Stat., 9(2):2855–2883, 2015. ISSN 1935-7524. doi:
10.1214/15-EJS1094. URL https://doi.org/10.1214/15-EJS1094.
Ahmed Hossam Mohammed, Mercedes Cabrerizo, Alberto Pinzon, Ilker Yaylali, Prasanna
Jayakar, and Malek Adjouadi. Graph neural networks in eeg spike detection. Artificial
Intelligence in Medicine, 145:102663, 2023. ISSN 0933-3657. doi: https://doi.org/10.1016/
j.artmed.2023.102663. URL https://www.sciencedirect.com/science/article/pii/
S093336572300177X.
Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and
β-mixing processes. J. Mach. Learn. Res., 11:789–814, 2010. ISSN 1532-4435,1533-7928.
John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J.
Franklin. Volume under the surface: a new accuracy evaluation measure for time-series
anomaly detection. Proc. VLDB Endow., 15(11):2774–2787, jul 2022a. ISSN 2150-8097.
doi: 10.14778/3551793.3551830. URL https://doi.org/10.14778/3551793.3551830.
John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S. Tsay, Themis Palpanas, and Michael J.
Franklin. Tsb-uad: an end-to-end benchmark suite for univariate time-series anomaly
detection. Proc. VLDB Endow., 15(8):1697–1711, apr 2022b. ISSN 2150-8097. doi:
10.14778/3529337.3529354. URL https://doi.org/10.14778/3529337.3529354.
Tuan D. Pham and Lanh T. Tran. Some mixing properties of time series models. Stochas-
tic Process. Appl., 19(2):297–303, 1985. ISSN 0304-4149,1879-209X. doi: 10.1016/
0304-4149(85)90031-6. URL https://doi.org/10.1016/0304-4149(85)90031-6.
47
Chazal, Levrard and Royer
Nalini Ravishanker and Renjie Chen. Topological data analysis (tda) for time series, 2019.
Ian W. Renner, Jane Elith, Adrian Baddeley, William Fithian, Trevor Hastie, Steven J.
Phillips, Gordana Popovic, and David I. Warton. Point process models for presence-
only analysis. Methods in Ecology and Evolution, 6(4):366–379, 2015. doi: 10.1111/
2041-210X.12352. URL https://besjournals.onlinelibrary.wiley.com/doi/abs/
10.1111/2041-210X.12352.
Emmanuel Rio. Covariance inequalities for strongly mixing processes. Ann. Inst. H. Poincaré
Probab. Statist., 29(4):587–597, 1993. ISSN 0246-0203. URL http://www.numdam.org/
item?id=AIHPB_1993__29_4_587_0.
Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum
covariance determinant estimator. Technometrics, 41(3):212–223, 1999. doi: 10.
1080/00401706.1999.10485670. URL https://www.tandfonline.com/doi/abs/10.1080/
00401706.1999.10485670.
Martin Royer, Frederic Chazal, Clément Levrard, Yuhei Umeda, and Yuichi Ike. Atol:
Measure vectorization for automatic topologically-oriented learning. In Arindam Baner-
jee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning
Research, pages 1000–1008. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.
press/v130/royer21a.html.
Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly detection in time
series: A comprehensive evaluation. Proceedings of the VLDB Endowment (PVLDB), 15
(9):1779–1797, 2022. doi: 10.14778/3538598.3538602.
Sondre Sørbø and Massimiliano Ruocco. Navigating the metric maze: a taxonomy of
evaluation metrics for anomaly detection in time series. Data Mining and Knowledge
Discovery, pages 1–42, 11 2023. doi: 10.1007/s10618-023-00988-8.
Cheng Tang and Claire Monteleoni. On lloyd’s algorithm: New theoretical insights for
clustering in practice. In Arthur Gretton and Christian C. Robert, editors, Proceedings of
the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of
Proceedings of Machine Learning Research, pages 1280–1289, Cadiz, Spain, 09–11 May
2016. PMLR. URL http://proceedings.mlr.press/v51/tang16b.html.
The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015.
URL http://gudhi.gforge.inria.fr/doc/latest/.
Yuhei Umeda, Junji Kaneko, Hideyuki Kikuchi, and Dr. Kikuchi. Topological data
analysis and its application to time-series data analysis. 2019. URL https://api.
semanticscholar.org/CorpusID:225065707.
48
Topological Analysis for Detecting Anomalies in Dependent Data
Weichen Wu, Jisu Kim, and Alessandro Rinaldo. On the estimation of persistence intensity
functions and linear representations of persistence diagrams. In International Conference
on Artificial Intelligence and Statistics, pages 3610–3618. PMLR, 2024.
Takehisa Yairi, Yoshikiyo Kato, and Koichi Hori. Fault detection by mining association
rules from house-keeping data. 2001.
Yu Zheng, Huan Yee Koh, Ming Jin, Lianhua Chi, Khoa T Phan, Shirui Pan, Yi-Ping Phoebe
Chen, and Wei Xiang. Correlation-aware spatial–temporal graph learning for multivariate
time-series anomaly detection. IEEE Transactions on Neural Networks and Learning
Systems, 2023.
49