0% found this document useful (0 votes)
5 views

2

Uploaded by

HARSHIT KHANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

2

Uploaded by

HARSHIT KHANNA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Journal of Machine Learning Research 25 (2024) 1-49 Submitted 6/24; Revised 9/24; Published 12/24

Topological Analysis for Detecting Anomalies (TADA) in


dependent sequences: application to Time Series.

Frédéric Chazal [email protected]


Inria Saclay
91120, Palaiseau, France
Clément Levrard [email protected]
Université de Rennes
35000, Rennes, France
Martin Royer [email protected]
Inria Saclay, IRT SystemX
91120, Palaiseau, France

Editor: Sayan Mukherjee

Abstract
This paper introduces a new methodology based on the field of Topological Data
Analysis for detecting structural anomalies in dependent sequences of complex data. A
motivating example is that of multivariate time series, for which our method allows to detect
global changes in the dependence structure between channels. The proposed approach is
lean enough to handle large scale data sets, and extensive numerical experiments back
the intuition that it is more suitable for detecting global changes of correlation structures
than existing methods. Some theoretical guarantees for quantization algorithms based on
dependent sequences are also provided.
Keywords: Topological Data Analysis, Unsupervised Learning, Anomaly Detection,
Multivariate Time Series, β-mixing coefficients.

1. Introduction
Monitoring the evolution of the global structure of time-dependent complex data, such as,
e.g., multivariate time series or dynamic graphs, is a major task in real-world applications of
machine learning. The present work considers the case where the global structure of interest
may be encoded by a persistence diagram, with a particular attention to the weighted
dynamic graph encoding the dependence structure between the different channels of a
multivariate time series. Such a situation may be encountered in various fields, such as e.g.
EEG signal analysis Mohammed et al. (2023) or monitoring of industrial processes Li et al.
(2022), and has recently given rise to an abundant literature - see, e.g. Zheng et al. (2023);
Ho et al. (2023) and references therein.
The specific monitoring task addressed in this paper is unsupervised anomaly detection,
that is to detect when the global structure is far enough from a so-called ’normal’ regime
to be considered as anomalous. In the multivariate time series framework, this amounts to
detect when the dependence patterns between channels depart from the normal regime. From
the mathematical point of view, this problem, in its whole generality, is ill-posed: one has

c 2024 Frédéric Chazal, Clément Levrard, Martin Royer.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/24-0853.html.
Chazal, Levrard and Royer

access to unlabeled data, in which it is tacitly assumed that the normal regime is prominent,
the goal is then to label data points as normal or abnormal in a fully unsupervised way.
In this sense, anomaly detection shows clear connection with outlier detection in robust
machine learning (for instance robust clustering as in Brécheteau et al. 2021; Jana et al.
2024). For more insights and benchmarks on the specific problem of anomaly detection in
time series the reader is referred to Paparrizos et al. (2022b) (univariate case) and to Wenig
et al. (2022) for the multivariate case.
We introduce a new framework, coming with mathematical guarantees, based on the use
of Topological Data Analysis (TDA), a field that has know an increasing interest to study
complex data - see, e.g. Chazal and Michel (2021) for a general introduction. Application
of TDA to anomaly detection in time series have raised a recent and growing interest: in
medicine (Dindin et al. 2019; G. et al. 2014; Chrétien et al. 2024), cyber security (Bruillard
et al. 2016), to name a few. Some general surveys on TDA applications to time series may
be found in Ravishanker and Chen (2019); Umeda et al. (2019).
In this paper, the proposed approach proceeds in three steps. First, the time-dependence
structure of a time series is encoded as a dynamic graph in which each vertex represents a
channel of the time series and each weighted edge encodes the dependence between the two
corresponding vertices over a time window. Persistent homology, a central theory in TDA,
is then used to robustly extract the global topological structure of the dynamic graph as
a sequence of so-called persistence diagrams. Second, we introduce a specific encoding of
persistence diagrams, that has been proven efficient and simple enough to face large-scale
problems in the independent case (Chazal et al. 2021). Finally, we produce a topological
anomaly score based on this encoding.
The two last steps may be applied to any dependent sequence of persistence diagrams,
encompassing diagrams generated from a sequence of filtered simplicial complexes in general.
Monitoring the dependence structure of multivariate time series being the practical motivation
of this work, the details of the full process (from raw data to anomaly scores) are given in
this setting.

1.1 Contributions
Our main contributions are the following.

- We produce a new machine learning methodology for learning the normal topological
behavior of complex data. This methodology is unsupervised, it does not need to be
calibrated on uncorrupted data, as long as the amount of corrupted data remains limited
with respect to the uncorrupted one. The proposed pipeline is easy to implement, flexible
and can be adapted to different specific applications and framework involving graph data
or more general topological data;

- In the multivariate time series case, the captured information is empirically shown to be
different and, in several cases, more informative than the one captured by other state-of-
the-art approaches. This methodology is lean by design, and enjoys novel interpretable
properties with regards to anomaly detection that have never appeared in the literature,
up to our knowledge;

2
Topological Analysis for Detecting Anomalies in Dependent Data

- The resulting method can be deployed on architectures with limited computational and
memory resources: once the training phase has been completed, the anomaly detection
procedure relies on a few memorized parameters and simple persistent homology compu-
tations. Moreover, this procedure does not require any storage of previously processed
data, preventing privacy issues;
- Some convergence guarantees for quantization algorithms - used to vectorize topological
information - in the dependent case are proven. These results do not restrict to the specific
setting of the paper and may be generalized in the general framework of M -estimation
with dependent observations;
- Extensive numerical investigation has been carried out in three different frameworks.
First on new synthetic data directly inspired from brain modeling problems as exposed
in Bourakna et al. (2022), that are particularly suited for TDA-based methods and may
be used as novel benchmark. They are added to the public The GUDHI Project (2015)
library at github.com/GUDHI/gudhi-data. Second on the comprehensive benchmark
’TimeEval’, that encompasses a large array of synthetic data sets Schmidl et al. (2022).
And third on a real-case ’Exathlon’ data set from Jacob et al. (2021-07). All of these
experiments assess the relevance of our approach compared with current state-of-the-art
methods. Our procedure, originating from concrete industrial problems, is implemented
and has been deployed within the Confiance.ai program and an open-source release is
incoming. Its implementation involves only standard, tested machine learning tools.

1.2 Organization of the paper


A complete description of the proposed methodology is provided in Section 2. Details on
the persistence diagrams construction are given for multivariate time series, providing a
comprehensive pipeline to build an anomaly score from raw data in this case. Next, Section
3 theoretically grounds the centroid computation step as well as the anomaly test introduced
at the end of the methodological section. Section 4 gathers the numerical experiments in
the three different settings introduced above (synthetic TDA-friendly, TimeEval synthetic,
real Exathlon data). Proofs of our results are postponed to Section 7.

2. Methodology
This section describes the process to build an anomaly score from a dependent sequence
of topologically structured data. Our method can be roughly decomposed into three steps
(depicted in Figure 1):

• Step 1 : From raw data build a sequence of topological descriptors of the structure via
persistent homology. The output of this step is a dependent sequence of persistence
diagrams (see Section 2.1.2).
• Step 2 : Convert the sequence of persistence diagrams into a vector-valued sequence, using
the approach depicted in Chazal et al. (2021) (see Section 2.2.2).
• Step 3 : Build from the vector-valued sequence an anomaly score function (based on robust
Mahalanobis distance, see Section 2.3).

3
Chazal, Levrard and Royer

Figure 1: TADA general scheme for producing anomaly scores with topological information from the
original time series.

To be consistent with the applications exposed in Section 4, details of these steps will be
given in the case where original data consists of multivariate time series (Yt )t∈[0,L] ∈ RD ×[0, L]
and the anomalies to be detected pertain to the dependence structure between channels.
This particular setting impacts Step 1 only. In other situations, Step 2 and 3 may apply as
such once suitable persistence diagrams are built from raw data.

2.1 Step 1: encode the dependence structure via persistence diagrams

Our first step may be summarized via the following Algorithm 1:

Algorithm 1: Persistence diagrams computation from a multidimensional time


series
Input: p maximal homology order, ∆ window size, s stride.
Data: A multivariate time series (Yt )t∈[0,L] ∈ RD × [0, L]
1 for t in [[0, b(L − ∆)/sc]] do
2 compute similarity matrix on the slice [st, st + ∆], St = 1 − Corr(Y[st,st+∆] );
3 compute the Vietoris-Rips filtration for ([[1, D]], E, St ) ;
4 for homology dimension d in [[0, p − 1]] do
(d)
5 compute order d persistence diagram Xt of the Rips filtration;
(d)
Output: p (discrete) time series of persistence diagrams Xt , t ∈ [[0, b(L − ∆)/sc]],
d ∈ [[0, p − 1]].

The two subsections that follow give details on the different steps of Algorithm 1. We
start with a brief description of the TDA tools that we use.

4
Topological Analysis for Detecting Anomalies in Dependent Data

2.1.1 Vietoris-Rips persistent homology for weighted graphs


In this subsection we briefly explain how discrete measures are associated to weighted graphs,
encoding their multiscale topological structure through persistent homology theory. We
refer the reader to Edelsbrunner and Harer (2010); Chazal et al. (2016); Boissonnat et al.
(2018) for a general and thorough introduction to persistent homology.
Recall that given a set V, an (abstract) simplicial complex is a set K of finite subsets
of V such that σ ∈ K and τ ⊂ σ implies τ ∈ K. Each set σ ∈ K is called a simplex of K.
The dimension of a simplex σ is defined as |σ| − 1 and the dimension of K is the maximum
dimension of any of its simplices. Note that a simplicial complex of dimension 1 is a graph.
A simplicial complex classically inherits a canonical structure of topological space obtained
by representing each simplex by a geometric standard simplex (convex hull of a finite set of
affinely independent points in an Euclidean space) and “gluing” the simplices along common
faces. A filtered simplicial complex (Kα )α∈I , or filtration for short, is a nested family of
complexes indexed by a set of real numbers I ⊂ R: for any α, β ∈ I, if α 6 β then Kα ⊆ Kβ .
The parameter α is often seen as a scale parameter.
Let G be a complete non-oriented weighted graph with vertex set V and real valued
edge weight function s : V × V → R, (v, v 0 ) 7→ sv,v0 , satisfying sv,v0 := sv0 ,v for any pair of
vertices (v, v 0 ).

Definition 1 Let αmin 6 minv,v0 ∈V sv,v0 and αmax > maxv,v0 ∈V sv,v0 be two real numbers.
The Vietoris-Rips filtration associated to G is the filtration (VRα (G))α∈[αmin ,αmax ] with vertex
set V defined by

σ = [v0 , · · · , vk ] ∈ VRα (G) if and only if svi ,vj 6 α, for all i, j ∈ [[0, k]],

for k > 1, and [v] ∈ VRα (G) for any v ∈ V and any α ∈ [αmin , αmax ].

The topology of VRα (G) changes as α increases: existing connected components may
merge, loops and cavities may appear and be filled, etc. Persistent homology provides a
mathematical framework and efficient algorithms to encode this evolution of the topology
(homology) by recording the scale parameters at which topological features appear and
disappear. Each such feature is then represented as an interval [αb , αd ] representing its life
span along the filtration. Its length αd − αb is called the persistence of the feature. The
set of all such intervals corresponding to topological features of a given dimension d - d = 0
for connected components, d = 1 for 1-dimensional loops, d = 2 for 2-dimensional cavities,
etc... - is called the persistence barcode of order d of G. It is also classically represented
as a discrete multiset Dd (G) ⊂ [αmin , αmax ]2 where each interval [αb , αd ] is represented by
the point with coordinates (αb , αd ) - a basic example is given on Figure 2. Adopting the
perspective of Chazal and Divol (2018); Royer et al. (2021); Chazal et al. (2021), in the
sequel of the
P paper, the persistence diagram Dd (G) will be considered as a discrete measure:
Dd (G) := p∈Dd (G) δp where δp is the Dirac measure centered at p. To control the influence
of the possibly many low persistence features, the atoms in the previous sum can be weighted:
X
Dd (G) := ω(b, d)δ(b,d) ,
(b,d)∈Dd (G)

5
Chazal, Levrard and Royer

where ω : R2 → R+ may either be a continuous function which is equal to 0 along the


diagonal or just a constant renormalization factor equal to the total mass of the diagram.
Notice that, in practice, there exist various libraries to efficiently compute persistence
diagrams, such as, e.g., The GUDHI Project (2015).
v0 v0 v0 v0

18.0
25.0
v1 v1 v1 v1
37.0
v3 v3 v3 v3
64.0
34.0
25.0

v2 v2 v2 v2

α = 30.0 α = 35.0 α = 40.0


(0, 64)

(34, 37)

D0 (G) = δ(0,18) + 2δ(0,25) + δ(0,64)


(0, 25) D1 (G) = δ(34,37)

(0, 18)

Figure 2: The persistence diagrams of order 0 and 1 of a simple weighted graph G whose vertices are
4 points in the real plane and edge weights are given by the squared distances between them. Here
αmin and αmax are chosen to be 0 and 64 respectively. The first line represents G and VRα (G) for
different values of α. The persistence barcodes and diagrams of order 0 and 1 are represented in red
and blue respectively on the second line.

The relevance of the above construction relies on the persistence stability theorem Chazal
et al. (2016). It ensures that close weighted graphs have close persistence diagrams. More
precisely, if G, G0 are two weighted graphs with same vertex set V and edge weight functions
s : V × V → R and s0 : V × V → R respectively, then for any order d, the so-called
bottleneck distance between the persistence diagrams Dd (G) and Dd (G0 ) is upper bounded
by ks − s0 k∞ := supv,v0 ∈V |sv,v0 − s0v,v0 | - see Chazal et al. (2014) for formal persistence
stability statements for Vietoris-Rips complexes.

2.1.2 From similarity matrices to persistence diagrams


Recall here that the data is assumed to be a multivariate time series (Yt )t∈[0,L] ∈ RD × [0, L].
We intend here to extract the topological information pertaining to the dependence structure
between the D channels via persistent homology.
To do so, for a window size ∆ > 0 and a stride parameter s > 0, we begin by slicing the
D-dimensional time series into n + 1 sub-intervals of the form [st, st + ∆], where 0 6 t 6 n
and n = b(L − ∆)/sc.

6
Topological Analysis for Detecting Anomalies in Dependent Data

Then, for each sub-interval [st, st + ∆], a coherence graph Gt is built, starting from the
fully-connected graph ([[1, D]], E) and specifying edge values as si,j,t = 1 − Cort (Yi , Yj ), that
is 1 minus the correlation between channels i and j computed in the interval t.
The size of the time bins ∆ is the time series equivalent of the resolution in images. In
practice, choosing a suitable resolution ∆ requires some prior information on the resolution
or scale at which the anomalies occur in the time series and might be detected. Up to
our knowledge, methods dedicated to detecting changes in time series implicitly use some
form of this prerequisite, see Section 4. Another way to formulate this is that the notion of
anomalies in time series is better defined at a certain resolution ∆, but a proper and explicit
definition is outside the scope of this work.
Since the number n + 1 of coherence graphs satisfies n = b(L − ∆)/sc, choosing a large
stride s produces a reduced amount of sample size n for the task that follows. In practice,
the stride s is chosen to be small, sometimes 1, as small as the computation time and power
allow so as to feed the next step with as much data as possible. From a theoretical point of
view, it turns out that the choice of s has no impact on the convergence results (see Section
3.3 for details).
Lastly, the persistence diagrams of the Vietoris-Rips filtration are computed (one per
(d)
homology order), resulting in sequences of diagrams Xt , with 0 6 t 6 n and d is the
homological order. An example of sequences of windows and corresponding persistence
diagrams is represented Figure 3. In what follows, a fixed homology order is considered, so
that the index d is removed. In practice, the vectorization steps that follows are performed
order-wise, as well as the anomaly detection procedure.

Figure 3: Left: three sliding windows on an illustrative Ornstein–Uhlenbeck (AR1) synthetic


process with additive, punctual anomalies. Right: corresponding to those sliding windows, the three
topological descriptors (persistence diagrams with Homology dimension 0 (red), 1 (blue) and 2 (green)
features), according to Algorithm 1.

It is worth noting that other dependence measures such as the ones based on coherence
Ombao and Pinto (2021) may be chosen instead of correlation to build the weighted graphs.

7
Chazal, Levrard and Royer

Such alternative choices do not affect the overall methodology nor the theoretical results
provided below.

In the numerical experiments, we give results for the correlation weights, that have
the advantages of simplicity and carry a few insights: following Bourakna et al. (2022),
such weights are enough to detect structural differences in the case where the channels Yj
are mixture of independent components Zp ’s, the weights of the mixture being given by a
(hidden) graph on the p’s whose structure drives the behavior of the observed persistence
diagrams.

Finally, it is worth recalling here that Vietoris-Rips filtration may be built on top of
arbitrary metric spaces, so that the persistence diagram construction may be performed in
more general cases, encompassing valued graphs (with value on nodes or edges) for instance.
The vectorization and detection steps below are based on the inputs of such persistence
diagrams, and can be applied verbatim to any situation where persistence diagrams can be
built from data.

2.2 Step 2: convert persistence diagrams into vectors

Once a sequence of persistence diagrams (Xi )i=1,...,n is built from data, the next step is
to convert these persistence diagrams into a vector-valued sequence. There exists several
data-driven methods to perform this vectorization step such as Hofer et al. (2019); Carrière
et al. (2020) to name a few supervised ones.

The simplest methods Adams et al. (2017); Royer et al. (2021) consist in evaluating a
persistence diagram X considered as a discrete measure on several test functions of the form
u 7→ ψ(ku − ck/σ), where ψ is a fixed kernel, c and σ are respectively a centroid and a scale
that vary to provide different test functions. Intuitively speaking, this amounts to encode
how much mass a persistence diagram spread around different centers at different scales.
Notice that such vectorizations fit within the general frameworks of linear representations of
persistence diagrams, see Wu et al. (2024); Divol and Chazal (2019).

In the case where the persistence diagrams (Xi )i=1,...,n are i.i.d., it has been shown that
the centroids and scales selection procedure exposed in Royer et al. (2021) offers several
advantages compared to the fixed grid approach of Persistence Image Adams et al. (2017).
For instance, for the same budget K of centroids, the approach of Royer et al. (2021)
experimentally outperforms the fixed-grid approach (see Royer et al. 2021, Table 3). In
most of the application depicted in Royer et al. (2021); Chazal et al. (2021) as well as
in Section 4, a budget K = 10 is enough to capture most of the topological information.
Such a low-dimensional vectorization usually spares the user a dimension reduction step.
Furthermore some theoretical guarantees are available Chazal et al. (2021) that assess the

8
Topological Analysis for Detecting Anomalies in Dependent Data

relevance of this procedure in a clustering framework. We thus adopt the same strategy for
the dependent case.
Algorithm 2: Persistence diagram vectorization
Input: K: dimension of the output vector. T : stopping time.
Data: X1 , . . . , Xn discrete measures.
1 Use Algorithm 3 (with stopping time T ) or 4 to get K centroids
(T ) (T )
c(T ) = (c1 , . . . , cK ).
2 for i = 1, . . . , n do
3 Use Algorithm 5 with parameter c(T ) on Xi to get vi ∈ RK .
Output: Vectorization v = (v1 , . . . , vn ).
Let us mention here that this vectorization algorithm may be applied to any dependent
sequence of measures (such as texts, point processes realizations, etc.), persistence diagrams
data being one example of such a framework. The following subsections give details on
the centroids computation algorithms (Algorithm 3 and 4) and the vectorization algorithm
(Algorithm 5) in the persistence diagram case.

2.2.1 Centroids computation

Recall here from Section 2.1.1 that the persistence diagrams Xi ’s are thought of as discrete
measures on R2 , that is

X
Xi = ω(b,d) δ(b,d) ,
(b,d)∈Di

where Di is the i-th persistence diagram considered as a multiset of points (see Section
2.1.1), and ω(b,d) are weights given to points in the persistence diagram (usually given as a
function of the distance from the diagonal, see e.g. Adams et al. 2017).
The goal is to find K centroids (c1 , . . . , cK ) ∈ R2 and scales (σ1 , . . . , σk ) such that the
vectorization Xi 7→ (Xi (du)ψ(ku−c1 k/σ1 ), . . . , Xi (du)ψ(ku−cK k/σK )) is an accurate enough
representation of the original persistence diagrams (where X(du)f (u) means integration of
f with respect to X).
The key idea behind the ATOL procedure Chazal et al. (2021) is to choose as centroids
nearly optimal minimizers of the empirical least-square criterion

c = (c1 , . . . , cK ) 7→ X̄n (du) min ku − cj k2 , (1)


j=1,...,k

1 Pn
where X̄n denotes the empirical mean measure, X̄n = n i=1 Xi .
As mentioned above, this method offers several advantages over fixed-grid strategies such
as Persistence Image in the i.i.d. case (see Royer et al. 2021; Chazal et al. 2021 for a more
in-depth discussion). The two algorithms that follow expose two methods to approximately

9
Chazal, Levrard and Royer

minimize (1), their theoretical justification is exposed in Section 3. Let us begin with the
batch algorithm.
Algorithm 3: Centroids computation - ATOL - Batch algorithm
Input: K: number of centroids. T : stopping time.
Data: X1 , . . . , Xn discrete measures.
(0) (0)
1 Initialization: c(0) = (c1 , . . . , cK ) randomly chosen from X̄n .
2 for t = 1, . . . , T do
3 for j = 1, . . . , K do
(t−1) (t−1)
4 Wj,t−1 ← {x ∈ R2 | ∀i 6= j kx − cj k 6 kx − ci k} (ties arbitrarily
broken).
5 if X̄n (Wj,t−1 ) 6= 0 then
(t) 
6 cj ← (X̄n (du) u1Wj,t−1 (u) )/X̄n (Wj,t−1 ).
7 else
(t)
8 cj ← random sample from X̄n .
(T ) (T )
Output: Centroids c(T ) = (c1 , . . . , cK ).
Algorithm 3 is the same as in the i.i.d. case (Chazal et al., 2021, Algorithm 1). Moreover,
almost the same convergence guarantees as in the i.i.d. case may be proven: for a good-
enough initialization, only 2 log(n) iterations are needed to achieve a statistically optimal
convergence (see Theorem 5 below). Therefore, a practical implementation of Algorithm 3
should perform several threads based on different initializations (possibly in parallel), each of
them being stopped after 2 log(n) steps, yielding a complexity in time of O(n log(n) × nstart )
(where nstart ) is the number of threads.
As for the i.i.d. case, an online version of Algorithm 3 may be conceived, based on
mini-batches. In what follows, for a convex set C ⊂ Rd , we let πC denote the Euclidean
projection onto C.
Algorithm 4: Centroids computation - ATOL - Minibatch algorithm
Input: K: number of centroids. q: size of mini-batches. R: maximal radius.
Data: X1 , . . . , Xn discrete measures.
(0) (0)
1 Initialization: c(0) = (c1 , . . . , cK ) randomly chosen from X̄n . Split X1 , . . . , Xn
into n/q mini-batches of size q:
B1,1 , B1,2 , B1,3 , B1,4 , . . . , Bt,1 , Bt,2 , Bt,3 , Bt,4 , . . . , BT,1 , BT,2 , BT,3 , BT,4 , T = n/4q.
2 for t = 1, . . . , T do
3 for j = 1, . . . , K do
(t−1) (t−1)
4 Wj,t−1 ← {x ∈ R2 | ∀i 6= j kx − cj k 6 kx − ci k} (ties arbitrarily
broken).
5 if X̄Bt,1 (Wj,t−1 ) 6= 0 then
(t)  
6 cj ← πB(0,R) (X̄Bt,3 (du) u1Wj,t−1 (u) )/X̄Bt,1 (Wj,t−1 ) .
7 else
(t) (t−1)
8 cj ← cj .
(T ) (T )
Output: Centroids c(T ) = (c1 , . . . , cK ).
Contrary to Algorithm 3, Algorithm 4 differs from its i.i.d. counterpart given in Chazal
et al. (2021). First, the theoretically optimal size of batches is now driven by the decay of

10
Topological Analysis for Detecting Anomalies in Dependent Data

the β-mixing coefficients of the sequence of persistence diagrams, as will be made clear by
Theorem 6 below.
Second, half of the sample are wasted (the Bt,j ’s with even j). This is due to theoretical
constraints to ensure that the mini-batches that are used are spaced enough to guarantee a
prescribed amount of independence. Of course, the even Bt,j ’s could be used to compute a
parallel set of centroids. However, in the numerical experiments, all the sample are used (no
space is left between minibatches) with no noticeable side effect.
From a computational viewpoint, Algorithm 4 is single-pass, so that, if nstart threads
are run, the global complexity is in O(n × nstart ).
Figure 4 below depicts an instance of centroids computed by these algorithms.

Figure 4: Left: three representative topological descriptors in the form of persistence diagrams with
Homology dimensions 0, 1 and 2. Right: sum of topological descriptors and their centroids (stars in
purple, two by dimension) computed from them in dimensions 0, 1 and 2 according to Algorithm 3
or Algorithm 4.

2.2.2 Conversion into a vector-valued sequence


Once the centroids c(T ) built, the next step is to convert the persistence diagrams (Xi )i=1,...,n
into vectors. The approach here is the same as in Royer et al. (2021). Denoting by
ψAT : u 7→ exp(−u2 ), a persistence diagram Xi is mapped onto
 
(T ) (T )
vi = Xi (du)ψAT (ku − c1 k/σ1 ), . . . , Xi (du)ψAT (ku − cK k/σK ) , (2)

where the scales σj ’s are defined by


(T ) (T )
σj = min kc` − cj k/2, (3)
`6=j

(T )
that roughly seizes the width of the area corresponding to the centroid cj . Other choices of
kernel ψ are possible (see e.g. Chazal et al. 2021), as well as other methods for choosing the

11
Chazal, Levrard and Royer

scales. The proposed approach has the benefit of not requiring a careful parameter tuning
step, and seems to perform well in practice.
We encapsulate this vectorization method as follows, and an example vectorization is
shown in Figure 5.
Algorithm 5: Vectorization step
Input: Centroids c1 , . . . , cK
Data: A persistence diagram X
1 for j = 1, . . . , K do
2 Compute σj as in (3);
3 vj ← X(du)ψAT (ku − cj k/σj )
Output: Vectorization v = (v1 , . . . , vK ).

Figure 5: Left: sum of topological descriptors and their centroids (stars in purple, two by dimension)
computed from them in dimensions 0, 1 and 2 by Algorithm 3 or 4. Right: the derived topological
vectorization of the entire time series computed relative to each center according to Algorithm 5.

2.3 Step 3: build an anomaly score


We assume now that we observe the vector-valued sequence v1 , . . . , vn of vectorized persistence
diagrams, and intend to build a procedure to determine whether a new diagram (processed
with Algorithm 2) may be thought of as an anomaly.
We first estimate the ’normal’ behavior of the vectorizations vi ’s that are thought of as
originating from a base regime. Namely, we build the sample means and covariances
n
1X
µ̂ = vi ,
n
i=1
n
1X
Σ̂ = (vi − µ̂)(vi − µ̂)T . (4)
n
i=1

12
Topological Analysis for Detecting Anomalies in Dependent Data

In the case where the base sample can be corrupted, robust strategies for mean and
covariance estimation such as Rousseeuw and Driessen (1999); Hubert et al. (2018) may be
employed. More precisely for a fraction parameter h ∈ [0; 1] we use the Minimum Covariance
Determinant estimator (MCD) defined by
!
1 X
Iˆ ∈ arg min Det (vi − v̄I )(vi − v̄I )T ,
I⊂{1,...,n},|I|=dnhe |I|
i∈I
µ̂ = v̄Iˆ,
 
1 X
Σ̂ = c0  (vi − µ̂)(vi − µ̂)T  , (5)
ˆ
|I| ˆ
i∈I

where v̄I denotes empirical mean on the subset I, and c0 is a normalization constant that can
be found in Hubert et al. (2018). In the anomaly detection setting, assuming that at least
half of sample points are not anomalies lead to the conservative choice h = 1/2. In practice,
for K-dimensional vectorizations v1 , . . . , vn , we adopt the default value h = (n + K + 1)/2n
prescribed in Rousseeuw and Driessen (1999); Lopuhaa and Rousseeuw (1991) that maximizes
the finite sample breakdown point of the resulting covariance estimator. In all the experiments
exposed in Section 4, we use the approximation of (5) provided in Rousseeuw and Driessen
(1999).
Now, for a new vector v, a detection score is built via

s2 (v) = (v − µ̂)T Σ̂−1 (v − µ̂), (6)

that expresses the normalized distance to the mean behavior of the base regime. We refer to
an illustrative example in Figure 6.

Figure 6: Left: the derived topological vectorization of the entire time series computed relative
to each center according to Algorithm 5. Right: (top, in blue) the binary anomalous timestamps
y of the original signal, matches (bottom, in orange) the topological anomaly score based on the
dimension 0 and 1 features of Algorithm 6.

13
Chazal, Levrard and Royer

If we let ŝ denote the score function based on data (Yt )t∈[0,T ] , then anomaly detection
tests of the form

Tα (v) = 1ŝ(v)>tα

may be built. To assess the relevance of this family of tests, the ROC AUC and RANGE PR AUC
metrics are used in the Application Section 4.
Should a test with specific type I error α be needed, a calibration of tα as the 1 − α
quantile of scores on a sample from the normal regime could be performed. Section 3.2
theoretically proves that this strategy is grounded.

2.4 Summary
We can now summarize the whole procedure into the following algorithm, with complementary
descriptive scheme in Figure 1.
Algorithm 6: TADA: Detection score from base regime time series
Input:
- TDA parameter: an integer p maximal homology order;

- Time series parameters: a window size ∆ and stride s;

- Vectorization parameters: a dimension K, a number of threads nstart , and possibly a


stopping time T or a mini-batch size q;

- Anomaly score parameters: a support fraction parameter h.

Data: A multivariate time series (Yt )t∈[0,L] ∈ RD × [0, L] (base regime, possibly
corrupted)
1 Convert (Yt )t∈[0,L] into n persistence diagrams (Xi )i=1,...,n via Algorithm 1;
2 Convert (Xi )i=1,...,n into v1 , . . . , vn using Algorithm 2;
3 Compute an anomaly score function s : RK → R+ defined by (6) from the
vector-valued sequence.
Output: An anomaly score function s : RK → R+ .
As such Algorithm 6 produces an anomaly score for all persistence diagrams (Xi )i=1,...,n .
For practical uses one often needs to get anomaly scores on the initial timestamps t ∈ [0, L].
In this case a standard time-sequence remapping is performed, we defer to Section 4 for
explicit details.
We summarize here the impact and defaults choices of the parameters:

• Time series parameters: the window size ∆ must be chosen according to the length of the
anomalies the user wants to detect. To be conservative, the stride s should be chosen as
small as computation time allows, and we use a heuristic default s = ∆/10, although in
theory any reasonable choice of s yields the same result.

• p maximal homology order: this depends on the complexity of the objects analyzed as each
homology order encodes topological features of the corresponding dimension (connected
components for homology order 0, cycles and holes for homology order 1, 2-dimensional

14
Topological Analysis for Detecting Anomalies in Dependent Data

voids for homology order 2...). The computational cost of persistence and the difficulty to
interpret persistence diagrams increase with the homology order. As a consequence, in
most practical applications the maximum homology order is set to 1 or 2. In our time
series application case, we restrict to p = 1 as the produced dependence structure generally
do not exhibit relevant higher dimensional features.

• Vectorization parameters: the dimension K can be chosen small, as it is one of the virtues
of this algorithm that it is very efficient even at very small scales. The default value
K = 5 by homology dimension yields good results in practice. Since Step 2 and 3 are fast
from a computational point of view, the user may try several increasing values of K and
stop when satisfied. The number of threads nstart corresponds to the number of parallel
runs of Algorithm 3 or 4, based on different initializations. The default value nstart = 10
performs well in practice. The choice of T is relevant only when Algorithm 3 is used, in
which case the default value T = 2 log(n) works well in theory and practice. The choice of
q is relevant only when Algorithm 4 is used. In this case several increasing values of q
may be tried, or q may be chosen based on prior knowledge on the mixing coefficient of
the sequence of persistence diagrams (see Section 3.1 for details).

• Anomaly score parameters: The default choice h = (n + K + 1)/2n is theoretically


grounded (see Rousseeuw and Driessen 1999; Lopuhaa and Rousseeuw 1991) and works
well in practice. There are two situations where h could be chosen larger than default.
First, when user have access to normal regime data to calibrate the method, in this case
robust estimators are not needed and h should be set to 1. Second, in the small sample
case, nh may be too small to yield an accurate enough covariance estimator in (5), so that
larger values of h may be tried.

Overall, only two parameters play an important role in our procedure: the window
size ∆ that corresponds to the resolution of the anomalies to be detected, and K the
size of the vectorizations of persistence diagrams. Importantly, no prior knowledge on the
temporal dependence structure (the mixing coefficients for instance) of the original time
series (Yt )t∈[0,L] is needed. All the experiments described in Section 4 use the default values.
Our proposed anomaly detection procedure Algorithm 6 has a lean design for the following
reasons. First it has few parameters, all of them coming with default values, at the exception
of the ∆ time series resolution that must be adjusted by reflecting on the type of anomalies
one is looking for. Second, very little tuning is needed. In the entire application sections
to come, the only parameter to change will be that resolution parameter ∆, a parameter
shared with other methods. All other parameters are set to default values. Third, upon
learning some data, TADA does not require a lot of memory: only the results of Algorithms 5
(centroids) and 6 (training vectorization mean and variance) are needed in order to produce
topological anomaly scores. This implies that our methodology is easy to deploy, and requires
no memory of training data which is often welcome in contexts of privacy for instance. It
also means that the methodology is able to compare very favorably to methods that are
memory-heavy such as tree-based methods, neural networks, etc.

15
Chazal, Levrard and Royer

3. Theoretical results
In this section we intend to assess the relevance of our methodology from a theoretical point
of view. Sections 3.1 and 3.2 give results in the general case where the sample is a stationary
sequence of random measures. Section 3.3 provides some details on how persistence diagrams
built from multivariate time series as exposed in Section 2.1.1 can be casted into this general
framework.

3.1 Convergence of Algorithms 3 and 4


In what follows we assume that X1 , . . . , Xn is a stationary sequence of random measures over
Rd , with common distribution X. Some assumptions on X are needed to ensure convergence
of Algorithms 3 and 4.
First, let us introduce here MNmax (R, M ) as the set of random measures that are bounded
in space, mass and support size.

Definition 2 For R, M > 0 and Nmax ∈ N∗ , we let MNmax (R, M ) denote the set of discrete
measures µ on Rd that satisfy
1. Supp(µ) ⊂ B(0, R),
2. µ(R2 ) 6 M ,
3. |Supp(µ)| 6 Nmax .
Accordingly, we let M(R, M ) denote the set of measures such that 1 and 2 hold.

With a slight abuse, if X denotes a distribution of random measures, we will write X ∈


MNmax (R, M ) whenever X ∈ MNmax (R, M ) almost surely. As detailed in Section 3.3,
persistence diagrams built from correlation matrices satisfy the requirements of Definition 2.
In the i.i.d. case, Chazal et al. (2021) proves that the output of Algorithm 3 with
T = 2 log(n) iterations returns a statistically optimal approximation of

c∗ ∈ Copt = arg min E(X)(du) min ku − cj k2 := F (c), (7)


c∈(R2 )K j=1,...,K

where E(X) is the so-called mean measure E(X) : A ∈ B(R2 ) 7→ E(X(A)). Note here that
X ∈ M(R, M ) ensures that Copt is non-empty (see, e.g., Chazal et al. 2021, Section 3). For
the aforementioned result to hold, a structural condition on E(X) is also needed.
For a vector of centroids c = (c1 , . . . , ck ) ∈ B(0, R)k , we let

Wj (c) = {x ∈ Rd | ∀i < j kx − cj k < kx − ci k and


∀i > j kx − cj k 6 kx − ci k},
N (c) = {x | ∃i < j x ∈ Wi (c) and kx − ci k = kx − cj k},

so that (W1 (c), . . . , Wk (c)) forms a partition of R2 and N (c) represents the skeleton of the
Voronoi diagram associated with c. The margin condition below requires that the mass of
E(X) around N (c∗ ) is controlled, for every possible optimal c∗ ∈ Copt . To this aim, let us
denote by B(A, t) the t-neighborhood of A, that is {y ∈ Rd | d(y, A) 6 t}, for any A ⊂ Rd
and t > 0. The margin condition then writes as follows.

16
Topological Analysis for Detecting Anomalies in Dependent Data

Definition 3 E(X) ∈ M(R, M ) satisfies a margin condition with radius r0 > 0 if and only
if, for all 0 6 t 6 r0 ,
Bpmin
sup E(X) (B(N (c∗ ), t)) 6 t,
c∗ ∈C opt 128R2

where B(N (c∗ ), t) denotes the t-neighborhood of N (c∗ ) and

1. B = inf c∗ ∈Copt ,j6=i kc∗i − c∗j k,

2. pmin = inf c∗ ∈Copt ,j=1,...,k E(X) (Wj (c∗ )).

According to (Chazal et al., 2021, Proposition 7), B and pmin are positive quantities whenever
E(X) ∈ M(R, M ). In a nutshell, a margin condition ensures that the mean distribution
E(X) is well-concentrated around k poles. For instance, finitely-supported distributions
satisfy a margin condition. Up to our knowledge, margin-like conditions are always required
to guarantee convergence of Lloyd-type algorithms Tang and Monteleoni (2016); Levrard
(2018) in the i.i.d. case.
For our motivating example of persistence diagrams built from a sequence of correlation
matrices between a time series channels, we cannot assume anymore independence between
observations. To adapt the argument of Chazal et al. (2021) in this framework, a quantifica-
tion of dependence between discrete measures is needed. We choose here to seize dependence
between observations via β-mixing coefficients, whose definition is recalled below.

Definition 4 For t ∈ Z we denote by σ(−∞, t) (resp. σ(t, +∞)) the sigma-fields generated
by . . . , Xt−1 , Xt (resp. Xt , Xt+1 . . .). The beta-mixing coefficient of order q is then defined
by
" #
β(q) = sup E sup |P(B | σ(−∞, t)) − P(B)| .
t∈Z B∈σ(t+q,+∞)

Recalling that the sequence of persistence diagrams is assumed to be stationary, its beta-
mixing coefficient of order q may be subsequently written as

β(q) = E(dT V (P(Xq ,Xq+1 ,...)|σ(...,X0 ) , P(Xq ,Xq+1 ,...) )),

where dT V denotes the total variation distance and PZ denotes the distribution of Z, for
a generic random variable Z. As detailed in Section 3.3, mixing coefficients of persistence
diagrams built from a multivariate time series may be bounded in terms of mixing coefficients
of the base time series. Whenever these coefficients are controlled, results from the i.i.d.
case may be adjusted to the dependent one.
We begin with an adaptation of (Chazal et al., 2021, Theorem 9) to the dependent case.

Theorem 5 Assume that X1 , . . . , Xn is stationary, with distribution X ∈ MNmax (R, M ),


for some Nmax ∈ N∗ . Assume that E(X) satisfies a margin condition with radius r0 , and
log(n/q)
denote by R0 = 16Br
√ 0 , κ0 = R0 . For q ∈ N∗ , choose T > d
2R R log(4/3) e, and let c
(T ) denote the

output of Algorithm 3.

17
Chazal, Levrard and Royer

If q is such that β(q)2 /q 3 6 n−3 , and c(0) ∈ B(Copt , R0 ), then, for n large enough, with
2
probability larger than 1 − c nκqkM
2 p2 − 2e−x , we have
0 min

B 2 r02  q  M 2 R2 k 2 d log(k)  q 
inf kc(T ) − c∗ k2 6 + C (1 + x),

c ∈Copt 512R2 n p2min n
for all x > 0, where C is a constant.
Moreover, if q is such that β(q)/q 6 n−1 and c(0) ∈ B(c∗ , R0 ), it holds
dk 2 R2 M 2 log(k)  q 
 
E ∗inf kc(T ) − c∗ k2 6 C
c ∈Copt κ20 p2min n
Intuitively speaking, Theorem 5 provides the same guarantees as in the i.i.d. case,
but for a ’useful’ sample size n/q. This ’useful’ sample size corresponds to the number of
sample measures that are spaced enough (in fact q-spaced) so that they may be considered
independent enough (with respect to the targeted convergence rate in q/n). This point of
view seems ubiquitous in machine learning results based on dependent sample (see, e.g.,
Agarwal and Duchi 2013, Theorem 1 or Mohri and Rostamizadeh 2010, Lemma 7).
Assessing the optimality of the requirements on β(q) is difficult. Following (Mohri
and Rostamizadeh, 2010, Corollary 20) and comments below, the β(q) 6 q/n condition
we require to get a convergence rate in expectation seems optimal for polynomial decays
(β(q) = O(q −a ), a > 0) in an empirical risk minimization framework. However, this choice
leads to a convergence rate in (q/n)(a−1)/(4a) for Mohri and Rostamizadeh (2010), larger
than our (q/n) rate. Though the output of Algorithm 3 is not an empirical risk minimizer,
it is likely that it has the same convergence rate as if it were (based on a similar behavior
for the plain k-means case, see e.g., Levrard 2018). The difference between convergence
rates given in Mohri and Rostamizadeh (2010) and Theorem 5 might be due to the fact that
Mohri and Rostamizadeh (2010) settles in a ’slow rate’ framework, where the convexity of
the excess risk function is not leveraged, whereas a local convexity result is a key argument
in our result (explicited by Chazal et al. 2021, Lemma 21).
In a fast rate setting (i.e. when the risk function is strictly convex), (Agarwal and
Duchi, 2013, Theorem 5) also suggests that a milder requirement in β(q)/q 6 n−1 might
be enough to get a O(q/n) convergence rate in expectation, for online algorithms under
some assumptions that will be discussed below Theorem 6 (convergence rates for an online
version of Algorithm 3). Up to our knowledge there is no lower bound in the case of
stationary sequences with controlled β coefficients that could back theoretical optimality of
such procedures.
At last, the sub-exponential rate we obtain in the deviation bound under the stronger con-
dition β(q)2 /q 3 6 n−3 seems better than the results proposed in (Mohri and Rostamizadeh,
2010, Corollary 20) or (Agarwal and Duchi, 2013, Theorem 5) in terms of large deviations
(in (q/n)x here to get an exponential decay). Determining whether the same kind of result
may hold under the condition β(q) 6 q/n remains an open question, as far as we know.
Nonetheless, Theorem 5 provides some convergence rates (in expectation) for several
decay scenarii on β(q):
• if β(q) 6 Cρq , for ρ < 1, then an optimal choice of q is q = c log(n), providing the same
convergence rate as in the i.i.d. case (Chazal et al., 2021, Theorem 9), up to a log(n)
factor.

18
Topological Analysis for Detecting Anomalies in Dependent Data

1
• if β(q) = Cq −a , for a > 0, then an optimal choice of q is q = Cn a+1 , that yields a
1
convergence rate in n−1+ a+1 .

In the last case, letting a → +∞ allows to retrieve the i.i.d. case, whereas a → 0 has for
limiting case the framework when only one sample is observed (thus leading to a non-learning
situation).
Whatever the situation, a benefit of Algorithm 3 is that a correct choice of q, thus the
prior knowledge of β(q), is not required to get an at least consistent set of centroids, by
log(n)
choosing T = d log(4/3) e. This will not be the case for the convergence of Algorithm 4, where
the size of minibatches q is driven by prior knowledge on β.
2 2
Theorem 6 Let q be large enough so that β(q/18) q2
6 n−2 and q > c0 pk2 Mκ2 log(n), for
R1 min 0
a constant c0 that only depends on 0 β −1 (u)du. Provided that E(X) satisfies a margin
condition, if the initialization satisfies the same requirements as in Theorem 5, then the
output of Algorithm 4 satisfies

kdM R2
 
(T ) ∗ 2
E inf kc −c k 6 128 .

c ∈Copt pmin (n/q)

As in Rio (1993), the generalized inverse β −1 is defined by β −1 (u) = |{k ∈ N∗ | β(k) > u}|. In
1
particular, for β(q) ∼ q −a , 0 β −1 (u)du is finite only if a > 1 (that precludes the asymptotic
R

a → 0).
The requirement β(q)/q 2 = O(n−2 ) is stronger than in Theorem 5, thus stronger that
the β(q)/q = O(n−1 ) suggested by (Agarwal and Duchi, 2013, Theorem 5) in a similar online
setting. Note however that for (Agarwal and Duchi, 2013, Theorem 5) to provide a O(q/n)
rate under the requirement β(q)/q = O(n−1 ), two other terms have to be controlled:

1. a total step sizes term in Tt=1 kct − c(t−1) k that must be of order O(1). Controlling
P
this term would require an slight adaptation of Algorithm 4, for instance by clipping
gradients.
P 
T  2 (t) ) − d2 (u, c∗ )
2. a regret term in E X̄
t=1 BT (du) d (u, c that must be of order O(q).
The behavior of this term remains unknown in our setting, so that determining whether
the milder condition β(q)/q = O(n−1 ) is sufficient remains an open question.

Let us emphasize here that, to optimize the bound in Theorem 6, that is to choose
the smallest possible q, prior knowledge of β(q) is required. This can be the case when
the original multivariate time series Yt follows a recursive equation as in Bourakna et al.
(2022). Otherwise, these coefficients may be estimated, using histograms as in McDonald
et al. (2015) for instance.
As in the batch case, the required lower bound on q corresponds to the ’optimal’ choice
of minibatch spacings so that consecutive even minibatches may be considered as i.i.d.. It is
then not a surprise that we recover the same rate as in the i.i.d. case, but with n/q samples
(see Chazal et al. 2021, Theorem 10). As for the batch situation, several decay scenarii may
be considered:

19
Chazal, Levrard and Royer

2 2
• for β(q) 6 Cρq , ρ < 1, choosing q = c0 pk2 Mκ2 log(n) for a large enough c0 is enough to
min 0
satisfy the requirements of Theorem 6, and yields the same result as in the i.i.d. case
((Chazal et al., 2021, Theorem 10)).
2
• for β(q) = Cq −a , β −1 (u) = (C/u)1/a . An optimal choice for q is then Cn a+2 , leading to a
2
convergence rate in n−1+ a+2 .

Let us mention here that the stronger condition in Theorem 6 leads to a slower convergence
bound for the polynomial decay case, compared to the output of Algorithm 3. Again,
assessing optimality of exposed convergence rates remains an open question, up to our
knowledge.

3.2 Test with controlled type I error rate


In this section, we investigate the type I error of the test

Tα 7→ 1s(v)>tn,α ,

where s is the score function built in Section 2.3 and tn,α will be built from sample to achieve
a type I error rate below α.
To keep it simple, we assume that Σ and µ in (4) are computed from a separate sample,
so that we observe

ṽi = Σ−1/2 (vi − µ)

from a stationary sequence of measures, resulting in a stationary sequence of vectors. Whether


Σ and µ should be computed on the same sample, extra terms involving concentration of Σ
and µ around their expectations should be added, as in the i.i.d. case.
We let Z denote the common distribution of the s(vi ) = kṽi k’s, that represent the
’normal’ behavior distribution of the time series structure. For the test Tα introduced above,
its type I error is then

Pkṽk∼Z (kṽk > tn,α ).

A common strategy here is to choose a tn,α from sample, such as


n
1X
1kṽi k>tn,α 6 α − δ,
n
i=1

for a suitable δ < α. In what follows we denote by t̂ such an empirical choice of threshold. The
following result ensures that this natural strategy remains valid in a dependent framework.

Proposition 7 Let q ∈ [[1, n]], and α, δ be positive quantities that satisfy


s
√ log(n)
5 α 6 δ < α.
(n/q)

20
Topological Analysis for Detecting Anomalies in Dependent Data

If t̂ is chosen such that


n
1X
1kṽi k>t̂ 6 α − δ,
n
i=1
q
4q n
then, with probability larger than 1 − n − β(q) αq , it holds

Pkṽk∼Y (kṽk > t̂) 6 α.

In other words, Proposition 7 ensures that the anomaly detection test 1kṽk>t̂ has a type I
error below α, with high probability. Roughly, this bound ensures that, for confidence levels
α above the statistical uncertainty of order q/n, tests with the prescribed
p confidence level
may be achieved via increasing the threshold by a term of order αq/n.
As for Theorem 5 or 6, choosing the smaller q that achieves β(q)2 /q 3 6 α/n3 optimizes
the probability bound in Proposition 7:

• for β(q) 6 Cρq , q of order C 0 log(n) is enough to satisfy β(q)/q 2 6 n−2 , providing the
same results as in the i.i.d. case, up to a log(n) factor.
3
• for β(q) = Cq −a , an optimal choice for q is of order n 2a+3 α−1/(2a+3) , that leads to the
2a
same bounds as in the i.i.d. case, but with useful sample size n/q = n 2a+3 α1/(2a+3) .

Although this bound might be sub-optimal, it can provides some pessimistic prescriptions
to select a threshold α − δ, provided that the useful sample sizepn/q is known. For instance,
assuming log(n) 6 6, for α = 5% the minimal δ is of order 2.7/ n/q, that is negligible with
respect to α whenever n/q is large compared to roughly 3000.

3.3 Theoretical guarantees for persistence diagrams built from multivariate


time series
We discuss here how the outputs (Xt )t=1,...n of Algorithm 1, based on a D-dimensional time
series (Yt )t∈[0;L] , fall in the scope of the previous sections. Recall that a persistence diagram
may be thought of as a discrete measure on R2 (see Section 2.2.1). In a nutshell, if (Yt )t∈[0;L]
is stationary with a certain profile of mixing coefficients, then so would (Xt )t=1,...n .
Stationarity: Since, for t = 1, . . . , n, Xt may be expressed as f ((Yu )u∈[s(t−1);s(t−1)+∆] , for
some function f , then stationarity of (Xt )t=1,...,n follows from stationarity of (Yt )t∈[0;L] .
Boundedness: We intend to prove that the outputs of Algorithm 1 are in the scope of
Definition 2. Let d be an homology dimension, t ∈ [[1; n]], and recall that Xt is the order
d persistence diagram built from the Vietoris-Rips filtration of ([[1; D]], E, St ), where E is
the set of all edges and St = 1 − Corr(Y[s(t−1);s(t−1)+∆] ) gives the weights that are filtered
(see Section 2.1.1). First note that, for every 1 6 i, j 6 D, St,(i,j) ∈ [0; 2], so that every
point in the persistence diagram is in [0; 2]2 . Next, since a birth of a d-order feature is
implied by an addition of a d-order simplex in the filtration (see for instance Boissonnat
et al. (2018) Section 11.5, Algorithm 11), the total number of points in the diagram is
D

bounded by d+1 . At last, for a bounded weight function ω, the total mass of Xt may be

21
Chazal, Levrard and Royer

D

bounded by d+1 × kωk∞ . We deduce that Xt ∈ MNmax (R, M ) (Definition 2), with R 6 4,
D D
 
Nmax 6 d+1 , and M 6 d+1 × kωk∞ . Note that in the experiments, we set ω ≡ 1.
Mixing coefficients: Here we expose how the mixing coefficients of (Xt )t=1,...,n (Definition
4) may be bounded in terms of those of (Yt )t∈[0;L] . Let us denote these coefficients by β
and β̃. If the stride s is larger than the window size ∆, then it is immediate that, for all
q > 1, β(q) 6 β̃(qs − ∆). If the stride s is smaller (or equal) than ∆, then, denoting by
q0 = b(∆/s)c + 1, we have, for q < q0 , β(q) 6 1 (overlapping windows), and, for q > q0 ,
β(q) 6 β̃(qs − ∆). The mixing coefficients of Xt may thus be controlled in terms of those
of Yt . For fixed ∆ and s, this ensures that mixing coefficients of Xt and Yt have the same
profile (and leads to the same convergence rates in Theorem 5 and 6).
• If β̃(q) 6 CY q −a , for CY , a > 0, then, for any q > q0 , β(q) 6 CY (s − ∆/q0 )−a q −a , so
that β(q) 6 CX (qs)−a , for some constant CX (depending on q0 and a).

• If β̃(q) 6 CY ρ̃q , for CY > 0 and ρ̃ < 1, then, for any q > q0 , β(q) 6 CY (ρ̃(s−(∆/q0 )) )q , so
that β(q) 6 CX ρqs , for some CX > 0 and ρ < 1 depending on q0 and ρ̃.
In turn, mixing coefficients of Yt may be known or bounded, for instance in the case where
it follows a recursive equation (see, e.g., Pham and Tran 1985, Theorem 3.1), or inferred
(see, e.g., McDonald et al. 2015). Interestingly, the topological wheels example provided in
Section 4.1 (borrowed from Bourakna et al. 2022) falls into the sub-exponential decay case.
This relation between the mixing coefficients of (Yt )t∈[0;L] and those of (Xt )t=1,...,n allows
to shed some light on the influence of the stride parameter s on the convergence results.
Assume for simplicity that the original time series is discrete with ñ points and ∆ is fixed.
In the two examples above it holds, for q > b∆c + 1, β(q) 6 C β̃(qs), where C depends on ∆
and the parameters of β̃ only. For a choice of stride s, the resulting sample size is n = ñ/s,
so that an optimal choice of q ≥ q0 with respect to Theorem 5 should satisfy

β(q) 1 β̃(qs) 1
= ⇔ C = ,
q n (qs) ñ

to provide a convergence rate in q/n = (qs)/ñ. Let q̃n be such that β̃(q̃n ) = q̃n /ñ (optimal
choice of q w.r.t. β̃). Then, provided s is not too large so that q̃n /s ≥ q0 , an optimal choice
of q w.r.t. the above bounds on β is qn = q̃n /s, leading to a convergence rate in q̃n /ñ,
whatever the chosen s. This backs the intuition that any reasonable choice of stride s (not
too large) should lead to the same theoretical guarantees.
Margin condition: The only point that cannot be theoretically assessed in general for
the outputs of Algorithm 1 is to know whether E(X) satisfies the margin condition exposed
in Definition 3. As explained below Definition 3, a margin condition holds whenever E(X)
is concentrated enough around k poles. Thus, structural assumptions on 1 − Corr(Y[0;∆] )
(for instance k prominent loops) might entail E(X) to fulfill the desired assumptions (as in
Levrard 2015 for Gaussian mixtures). However, we strongly believe that the requirements of
Definition 3 are too strong, and that convergence of Algorithms 3 and 4 may be assessed
with milder smoothness assumptions on E(X). This falls beyond the scope of this paper,
and is left for future work. The experimental Section 4 to follow assesses the validity of our
algorithms in practice.

22
Topological Analysis for Detecting Anomalies in Dependent Data

4. Applications
In order to make the case for the efficiency of our proposed anomaly detection procedure
TADA, we now present an assortment of both real-case and synthetic applications. The
first application (introduced as the Topological Wheels problem) is directly derived from
Bourakna et al. (2022). It consists of a synthetic data set designed to mimic dependence
patterns in brain signals, and allows to demonstrate the relevance of a topologically based
anomaly detection procedure on such complex data. The second application is an up-to-date
replication of a benchmark with the TimeEval library from Schmidl et al. (2022) on a large
array of synthetic data sets to quantitatively demonstrate competitiveness of the proposed
procedure with current state-of-the-art methods. The third application is a real-case data set
from Jacob et al. (2021-07) consisting of data traces from repeated executions of large-scale
stream processing jobs on an Apache Spark cluster. Lastly we produce interpretability
elements for the anomaly detection procedure TADA.
Evaluation of an anomaly detection procedure in the context of time series data has
many pitfalls and can be hard to navigate, we refer to the survey of Sørbø and Ruocco
(2023). Here we mainly evaluate anomaly scores with the robustified version of the Area
Under the PR-Curve: the Range PR AUC metric of Paparrizos et al. (2022a) (later just
’RANGE PR AUC’), where a metric of 1 indicates a perfect anomaly score, and a metric
close to 0 indicates that the anomaly score simply does not point to the anomalies in
the data set. For the sake of comparison with the literature we also include the Area
Under the ROC-Curve metric (later just ’ROC AUC’), although this metric is less accurate
and powerful in the unbalanced context of anomaly detection (Sørbø and Ruocco 2023).
Therefore each collection of anomaly detection problems will yield evaluation statistics, and
to summarize comparisons between algorithms we use a critical difference diagram, that is a
statistical test between paired populations using the package Herbold (2020). Furthermore
we introduce two other statistical summaries of interest:
• the ’# > .9’ metric, that is the number of anomaly detection problems for which an
algorithm has a RANGE PR AUC over .9. A RANGE PR AUC over .9 roughly indicates
that the algorithm ’finds well’ the anomalies in the data set or ’solves’ the problem;

• the ’#rank1’ metric, that is the number of problems for which an algorithm reaches the
best RANGE PR AUC score over other algorithms, ties being shared.
For the purpose of comparison with the state-of-the-art we draw methods from the recent
benchmark of Schmidl et al. (2022). We take the three best performing methods from the
unsupervised, multivariate category: the ’KMeansAD’ anomaly detection based on k-means
cluster centers distance using ideas from Yairi et al. (2001), the baseline density estimator
k-nearest-neighbors algorithm on time series subsequences ’SubKNN’, and ’TorskAD’ from
Heim and Avery (2019), a modified echo state network for anomaly detection.
Furthermore, in order to better understand the value of the introduced topological
methodology, we compare to a close-resembling method that couples the topological features
of Algorithm 5 to the isolation forest algorithm from Liu et al. (2008), resulting in an
unsupervised anomaly detection method denominated as ’Atol-IF’ in reference to the Royer
et al. (2021) paper. We also couple those topological features to a random forest classifier
Breiman (2001-10), resulting in a supervised anomaly detection method denominated as

23
Chazal, Levrard and Royer

’Atol-RF’ that gives an idea on what can be achieved in the supervised context. Lastly, to
investigate the differences between the proposed topological analysis and a more standard
spectral analysis, we compute spectral features on the correlation graphs coupled to either
an isolation forest or to a random forest classifier, in an unsupervised anomaly detection
method denominated as ’Spectral-IF’ and a supervised one named ’Spectral-RF’.
In practice all those methods involve a form of time-delay-embedding or subsequence
analysis or context window analysis (we use these terms synonymously in this work), that
requires to compute a prediction from a window size number ∆ of past observations. ∆ is
a key value that acts as the equivalent of image resolution or scale in the domain of time
series. In using a subsequence analysis, given a ∆-uplet of timestamps [t] := (t1 , t2 , ..., t∆ ),
once an anomaly score s[t] is produced it is related to that particular ∆-uplet but does not
refer to a specific time step. A window reversing step is needed to map the scores to the
original timestamps. For fair comparison, we will provide all methods with the following
(same) last-step window reversing procedure: P for every time step t, one computes the sum of
windows containing this time step ŝt := [t0 ]:t∈[t0 ] s[t0 ] . Here we choose not to use the more
P P 
classical average s̃t := [t0 ]:t∈[t0 ] s[t0 ] / [t0 ]:t∈[t0 ] 1 , since this average produces undesirable
border effects (the timestamps at the beginning and end of the signal are contained in less
windows, making them over-meaningful after averaging). Using the sum instead has no
effect on anomaly scoring (outside of borders) as the metrics are scale-invariant.
For the specific use of TADA in this section, the centroids computation part of Section
2.2.1 is made using ω(b,d) = 1 and computed with the batch version described in Algorithm 3.
Our implementation relies on The GUDHI Project (2015) for the topological data analysis
part, but also makes use of the Pedregosa et al. (2011) Scikit-learn library for the anomaly
detection part, minimum covariance determinant estimation and overall pipelining. The
code is published as part of the ConfianceAI program https://catalog.confiance.ai/
and can be found in the catalog: https://catalog.confiance.ai/records/4fx8n-6t612.
All computations were made on a standard laptop (i5-7440HQ 2.80 GHz CPU).

4.1 Introducing the Topological Wheels data set.


In this first application section we introduce a hard, multiple time series unsupervised
problem that emulates brain functions, and compare our method with state-of-the-art
anomaly detection methods as well as supervised concurrent methods.
Bourakna et al. (2022) introduces ideas for evaluating methodologies relying on TDA
such as ours. They allow to produce a multiple time series with a given node dependence
structure from a mixture of latent autoregressive process of order two (AR2). One direct
application for this type of data generation is to emulate the network structure of the
brain, whose normal connectivity is affected by conditions such as ADHD or Alzheimer’s
disease. Therefore, in accordance with Bourakna et al. (2022), we design and introduce
the Topological Wheels problem: a multiple time series data sets with a latent dependence
structure ’type I’ as a normal mode, and sometimes a ’type II’ latent dependence structure
as an abnormal mode. For the type I dependence structure we use a single wheel where
every node are connected by pair, and every pair are connected to two others forming a
wheel; then we connect a pair of pairs forming an 8 shape or double wheel (see Figure 7).
For the type II structure we start from a double wheel and add another connection between

24
Topological Analysis for Detecting Anomalies in Dependent Data

two pairs. The first mode of dependence is the prominent mode for the time series duration,
and is replaced for a short period at a random time by the second mode of dependence. The
total signal involves 64 time series sampled at 500 Hz for a duration of 20 seconds, see Figure
7. We produce ten such data sets and call them the Topological Wheels problem. It consists
in being able to detect the change in underlying patterns without supervision. We note
that by design the two modes are similar in their spectral profile, so detecting anomalies
should be hard for methods that do not capture the overall topology of the dependence
structure. The data sets are available through the public The GUDHI Project (2015) library
at github.com/GUDHI/gudhi-data.

Figure 7: Left: Synthetised time series and the latent generating process (orange) indicating normal
connection (double circular wheel with middle connection, on top) or abnormal connection (double
circular wheel with two connections, on the bottom). Right: Anomaly scores of all tested methods
on one of the data sets, and their RANGE PR AUC metric on comparing with the truth (bottom
row) in parenthesis.

For assessing performances we use a cross-validation like procedure with a focus on


evaluation: we perform ten experiments and for each experiment, every method is fitted on
one data set and evaluated on the other nine data sets. We then rotate the training data set
until all ten data sets have been used for training. We use this particular setup in order to
be able to compare supervised and unsupervised methods on comparable grounds. As for
method calibration we note the following: all methods are given (and use) the real length
of the anomalous segment of ∆ = 500 consecutive timestamps, and since all of them use
window subsequences we set the same stride for all to be s = ∆/10 = 50. Lastly for all
methods that use a fixed-size embedding (Atol-based methods and spectral-based methods),
we set the support size to K = 10.
We first show in Figure 7 the results of one iteration of learning, that is when all methods
are trained on a topological wheels data set and evaluated on another. The last row of the
figure with label ’truth’ shows the underlying signal value of the evaluated data set. The
other rows are the computed anomaly score of each method along the time x-axis, with
the convention that the lower the score, the more abnormal the signal. The corresponding
RANGE PR AUC score of each method is written in the label. This first example confirms

25
Chazal, Levrard and Royer

the intuition that methods that do not rely on topology, that is the spectral method, the
k-nearest-neighbor method and the modified echo state network method all fail to capture
the anomaly. This is particularly striking for the spectral method as it was trained with
supervision. On the other hand all methods based on the topological features manage to
capture some indication that there is anomaly in the signal. For the isolation forest method,
even though it clearly separates the anomalous segment from the rest, it is not reliable as it
seems to indicate other anomalies when there aren’t. The random forest supervised method
perfectly discriminates the anomalous segment from the rest of the time series, and so does
our method almost as reliably.

Figure 8: Left: aggregated results for the Topological Wheels problem in the form of box plots for
the ROC AUC and RANGE PR AUC metrics, and below them the points the metrics have been
computed from - where each point represents a metric score from comparing an anomaly score to
the underlying truth. Right: autorank summary (see Herbold (2020)) ranking of methods on the
Topological Wheels problem, showing competitiveness between our unsupervised TADA method and
the equivalent topological supervised method.

We now look at the aggregated results for the entire problem, see Figure 8 and times
Table 1. We present both ROC AUC and RANGE PR AUC averages with their standard
deviations over experiments, as well as the computation times for the sake of completeness.
Neither the spectral procedure nor the echo state network, subKNN or k-nearest-neighbor
methods are able to capture any information from the Topological Wheels problem. Using
topological features with an isolation forest yields competitive results but it is simply inferior
to our procedure. This demonstrates that the key information to this problem lies in the
topological embedding which is not surprising, by design. Our procedure solves this problem
almost perfectly, and although it is unsupervised it is as competitive as a comparable
supervised method. This experiment demonstrates the impact of topology-based methods
for anomaly detection, as the non-topology methods fail to capture any of the signal in the

26
Topological Analysis for Detecting Anomalies in Dependent Data

algorithm #xp # > .9 #rank1 med time iqr time


Atol-IF 90 14 8 18.2 5.4
Atol-RF 90 38 43 22.3 2.5
KMeansAD 90 0 0 0.6 0.0
Spectral-IF 90 0 0 1.2 0.1
Spectral-RF 90 3 2 1.1 0.1
SubKNN 90 0 0 1.3 0.1
TADA 90 51 37 18.5 3.7
TorskAD 90 1 2 112.8 2.0

Table 1: Summary statistics on the Topological Wheels problem for the algorithms evaluated. All
methods could produce scores for the 90 experiments, and without surprise the methods relying on
topological analysis are overwhelmingly dominating other methods in 86 out of 90 experiments. Our
unsupervised method TADA is on par with a supervised learning method for the number of problem
where it has the best PR AUC score (#rank1 column). In seconds, the computation median time
(’med time’) and interquartile range (’iqr time’) are standard with respect to the data sizes - also
note that computations are not optimized, and in fact performed on a single laptop.

data sets. Our proposed TADA method is clearly the best suited for learning anomalies in
this setup.

4.2 A benchmark using the TimeEval library.


We now look at a broader and more general array of problems, to evaluate the competitiveness
of our method with respect to state-of-the-art methods. For that purpose, we use the
GutenTAG multivariate data sets, drawn from Wenig et al. (2022).
We chose the GutenTAG data sets 1 for the ability to generate a great (1174) number of
varied anomaly detection problems; they are mostly formed from an insertion of anomalies
of various lengths into a multivariate time series of 10 000 timestamps. These anomalies can
be abnormal in terms of frequency, variance, or extremum values.
We believe that the GutenTAG data sets are a collection of mostly easy anomaly detection
problems, but that they become challenging and interesting considered as a whole. Because
they have various types of anomalies encoded in them, evaluating methods with little to no
calibration on the entire collection is informative for evaluating performance and robustness
in detecting anomalies in a somewhat general case. As the anomalies in these data sets vary
from one to a thousand consecutive timestamps, we set the window sizes of the anomaly
detectors to be a fixed ∆ = 100. Other than that, all other parameters from the previous
section are left unchanged (with the exception of the stride that depends on ∆, we keep
s = ∆/10 = 10).
1. After publication the GutenTAG authors have commented that they had an unwanted ar-
tifact in the GutenTAG data sets generation (we refer to https://timeeval.github.io/
evaluation-paper/notebooks/Datasets.html). Therefore we used the new GutenTAG data sets pro-
vided by the authors here: https://github.com/TimeEval/GutenTAG/blob/main/generation_configs/
multivariate-test-cases.py and only remove the instances where the number of time series D = 500
caused exceedingly large computing times.

27
Chazal, Levrard and Royer

Figure 9: Left: aggregated results for the TimeEval 1084 GutenTAG synthetic data sets in the
form of box plots for the ROC AUC and RANGE PR AUC metrics, and below them the points the
metrics have been computed from - where each point represents a metric score from comparing an
anomaly score to the underlying truth. Right: autorank summary (see Herbold (2020)) ranking of
methods, showing fourth place ranking for our purely topological TADA method.

algorithm #xp # > .9 #rank1 med time iqr time


Atol-IF 1174 859 373 3.8 0.4
KMeansAD 1174 383 233 0.8 0.1
Spectral-IF 1174 750 302 1.7 0.2
SubKNN 1174 970 996 0.9 0.3
TADA 1174 530 231 4.6 0.5
TorskAD 733 70 21 14.4 1.3

Table 2: Summary statistics on the GutenTAG problem set for the algorithms evaluated. Even
though it is ranked fourth by the statistical pairwise-ranking in Figure 9, our unsupervised method
TADA is able to solve roughly half of the problems, and is competitive on 231 of them. The other
topological method Atol-IF ranks second and is able to solve 859 of the 1174 problems. In seconds,
the computation median time (’med time’) and interquartile range (’iqr time’) are standard with
respect to the data sizes involved - computations are performed on a single laptop. TorskAD failed
to return scores on a number of data sets due to unknown bugs.

Statistical summaries and results of the experiment on the synthetic data sets are shown
Figure 9, and in Table 2. As a reminder, the SubKNN, KMeansAD and TorskAD methods
were the top three performing methods from the largest anomaly detection benchmark
to date (see Table 3 from Schmidl et al. 2022). Our TADA procedure manages to solve
roughly half of the problems and is a top contender among competitors for about a quarter
of them. Atol-IF performs better than TADA in this instance, which is not surprising as

28
Topological Analysis for Detecting Anomalies in Dependent Data

Figure 10: Correlations between algorithm RANGE PR AUC scores on the GutenTAG data set
collections. The two topological methods TADA and Atol-IF correlate at .86, and the highest
correlation reaches .87 for Atol-IF and Spectral-IF. This indicates that the algorithms involved each
solve different GutenTAG problems although some overlap naturally occurs.

isolation forest retains much more information from training than TADA, which also implies
heavier memory loads. Overall SubKNN is able to perform the best on those data sets, and
TADA and Atol-IF show good performances, and in some instances only the topological
methods manage to solve the problem, see for instance Figure 11. These results demonstrate
competitiveness of our methodology in the unsupervised anomaly detection learning context.

4.3 Exathlon real data sets


Lastly we turn to a real collection of data sets: the 15 Exathlon data sets from Jacob et al.
(2021-07) consisting in data traces from repeated executions of large-scale stream processing
jobs on an Apache Spark cluster, and anomalies are intentional disturbances of those jobs.
Using the same metrics, collection of anomaly detection methods and exact same
calibration as in the previous TimeEval experiment, we produce the following results. The
main statistical summaries are presented in Figure 12 and Table 3. Overall the topological

29
Chazal, Levrard and Royer

Figure 11: Left: zoom in on a GutenTAG problem instance (sensors selected for visualization) with
a variance anomaly for the last two sensors (purple and red), and the ground truth (last row). Right:
problem ground truth (top row) and the corresponding anomaly scores of each method with their
RANGE PR AUC metric in parenthesis. The topological methods TADA and Atol-IF get a good
metric score and manage to find the anomalies, while no other method does.

Figure 12: Left: aggregated results for the 35 Exathlon real data sets in the form of box plots for
the ROC AUC and RANGE PR AUC metrics, and below them the points the metrics have been
computed from - where each point represents a metric score from comparing an anomaly score to the
underlying truth. Right: autorank summary (see Herbold (2020)) ranking of methods on the real
data sets, showing second and third place ranking for the topological methods Atol-IF and TADA.

methods are strong competitors for these data sets, with TADA coming off as the most often
number one ranked method. Due to the real nature of the data sets, it is not surprising that
the studied methods do not ’solve’ them as well as they solve the GutenTAG data sets or the

30
Topological Analysis for Detecting Anomalies in Dependent Data

algorithm #xp # > .9 #rank1 med time iqr time


Atol-IF 15 0 3 10.0 448.3
KMeansAD 15 1 1 3.3 2.6
Spectral-IF 15 0 4 6.2 1.5
SubKNN 15 1 2 8.6 13.4
TADA 15 1 5 11.5 450.7
TorskAD 15 0 0 34.3 37.4

Table 3: Summary statistics on the Exathlon real data problems for the algorithms evaluated. Even
though it is ranked third by statistical ranking in Figure 12, our unsupervised method TADA is
the top RANGE PR AUC score (#rank1 column) over all problems, which hints that it is able to
solve different sorts of anomaly detection problems than the others. In seconds, the computation
median time (’med time’) and interquartile range (’iqr time’) are high for the topological methods,
see commentaries in the text. Computations are performed on a single laptop.

TopologicalWheels data sets. We show in Figure 13 the one instance where TADA is able to
solve the problem completely, and highlight that no calibration has been needed to do so.

Figure 13: Left: zoom in on data set 3 2 1000000 71. Right: ground truth (top row) and anomaly
scores for the six methods and their RANGE PR AUC score in parenthesis. While all methods are
able to locate the beginning of the anomaly period, only TADA manages to catch it in its entirety.

One drawback of the topological methods appearing here is the high variance in execution
time, which originates from computing topological features on a great number of sensors.
Since our implementation of Algorithm 6 is naive, we point that there are strategies for
optimizing computation times: ripser, subsampling, clustering that makes sense, etc. Those
strategies are outside the scope of this paper.

31
Chazal, Levrard and Royer

4.4 Score interpretability


The anomaly score we introduce is constructed from compressing the mean measure of
persistence diagrams into K centroids, and analyzing the resulting embedding’s main
distribution features. Once these centers are learned it is possible to engineer anomaly scores
respective to a particular center, or possibly to a set of centers e.g. centers associated with
a particular homology dimension. Let us examine this first possibility, and introduce the
center-targeted scores:
−1/2
s̃i = Σ̂ii |vi − µ̂i |, (8)

where µ̂, Σ̂ are the estimated mean and covariance of the vectorization v of Algorithm 5
(time indices are implied and omitted for this discussion). These scores can be interpreted as
testing for anomalies with respect to a single embedding dimension, as if the vectorization
had independent components. These center-targeted scores allow to analyze an original
anomaly by looking at the score deviations of each vector component. Because the vector
components are integrated from a learned centroid, the scores can be traced back to a specific
region in R2 , see for instance Figure 14.

Figure 14: Left: Overall score (top curve, in blue), scores for s̃i (middle curves, three top ones
for features corresponding to homology dimension 0, three bot ones for features corresponding to
homology dimension 1) and ground truth (bottom curve) on a single Topological Wheels data set.
The last homology dimension 1 has the strongest correspondence to the overall score, and matches
the underlying truth almost exactly. Right: s̃i scores of an abnormal persistence diagram on this
data set, next to the associated quantization centers (colored stars) of Algorithm 3. The scores are
written in a font size proportional to them, so that the more abnormal scores appear bigger. In this
instance the quantization centers in dimension 0 and 1 with the highest persistence react to this
diagram, hinting at a change in the latent data structure as the highest persistence diagram points
are usually associated with signal in comparison with points nearer the diagonal.

This leads to valuable interpretation. For instance if an abnormal score of TADA were
caused by a large deviation in a homology dimension 1 center-related component, it is likely
that at that time an abnormal dependence cycle is created for a longer or shorter period of

32
Topological Analysis for Detecting Anomalies in Dependent Data

time than for the rest of the time series, therefore that the dependence pattern has globally
changed in that period of time. See for instance an illustration on the Topological Wheels
problem in Figure 14 where globally changing the dependence pattern between sensors is
exactly how the abnormal data was produced. In the case where the produced score is
deemed abnormal due to a shift in several dimension 0 centers-related components, this
indicates an anomaly in the connectivity dependence structure that does not affect the
higher order homological features. In this case, the anomaly could be attributed to a default
(such as a breakdown) in one of the original sensors for instance.

5. Conclusion
It is common knowledge that no anomaly detection method can help with identifying all kinds
of anomalies. The framework introduced in this paper is relevant for detecting abnormal
topological changes in the global structure of data.
For the motivating example of detecting changes in the dependence patterns between
channels of a multivariate time series, our method turns out to be competitive with respect
to other state-of-the-art approaches on various benchmark data sets. Naturally, there are
many different sorts of anomalies that the proposed method is not able to detect at all. For
instance, since the topological embedding is invariant to graph isomorphism, any anomalies
linked to node permutation (change of node labeling) cannot be caught. The same is
true for homothetic jumps: when signals would simultaneously get identically multiplied,
the correlation-based similarity matrix would remain unchanged, leading to unchanged
topological embedding. While such invariances are generally thought of as limitative, they
can also be a welcome feature if the considered problem introduces the same limitation - for
instance when there is labeling uncertainties in sensors collecting data.
The topological anomaly detection finds anomalies that other methods do not seem to
discover. It is generally understood that topological information is a form of global informa-
tion that is complementary to the information gathered by more traditional approaches, e.g.
spectral detectors. While confirming this, the above numerical experiment also suggest that
topological information is commonly present in various real or synthetic data sets. Therefore
for practical applied purposes it is probably best to use our method in combination with
other dedicated methods, for instance one that focuses on ’local’ data shifts such as the
SubKNN method.
Focusing on the case of detection of anomalies in the dependence structure of multivariate
time series, it appears that the only parameter that requires a careful tuning in our method
is the window size (or temporal resolution) ∆, as it probably also does for most of existing
procedures (see Section 4). Designing methods to empirically tune this window size, or to
combine the outputs of our method at different resolutions would be a relevant addition to
our work, that is left for future research.
Finally, let us recall here the flexibility of the framework we introduce. First, it is not tied
to detect changes in correlation structures in time series: we may use Algorithm 1 with other
dissimilarity measures between channels, and more generally the two last steps may apply
verbatim whenever a sequence of persistence diagrams may be built from data (for instance
arising from filtrations on meshed shapes, images, graphs etc.). Second, the vectorization
we propose with Algorithm 3 and 5 does not necessarily takes a sequence of persistence

33
Chazal, Levrard and Royer

diagrams as input: any sequence of measures may be vectorized the same way. It may
find applications in monitoring sequences of point processes realizations, as in evolution of
distributions of species for instance - see, e.g., Renner et al. (2015). Finally, one may process
the output of the vectorization procedure in other ways than building an anomaly score.
For instance, using these vectorizations as inputs of any neural network, or change-points
detection procedures such as KCP (Arlot et al. 2019) could provide a dedicated method to
retrieve change points of a global structure.

6. Acknowledgements
This work has been supported by the French government under the ’France 2030’ program,
as part of the SystemX Technological Research Institute within the Confiance.ai project.
The authors also thank the ANR TopAI chair in Artificial Intelligence (ANR–19–CHIA–0001)
for financial support.
The authors are grateful to Bernard Delyon for valuable comments and suggestions
concerning convergence rates in β-mixing cases.

7. Proofs
Most of the proposed results are adaptations of proofs in the independent case to the
dependent one. A peculiar interest lies in concentration results in this framework, we list
important ones in the following section.

7.1 Probabilistic tools for β-mixing concentration


In the derivations to follow extensive use will be made of a consequence of Berbee’s Lemma.
Lemma 8 (Doukhan et al., 1995, Proposition 2) Let (Xi )i>1 be a sequence of random
variables taking their vales in a Polish space X , and, for j > 0, denote by
" #
bj = E sup |P(B | σ(−∞, j)) − P(B)| .
B∈σ(j+1,+∞)

Then there exists a sequence (X̃i )i>1 of independent random variables such that, for any
i > 1, X̃i and Xi have the same distribution and P(Xi 6= X̃i ) 6 bi .
The above lemma allows to translate standard concentration bounds from the i.i.d. framework
to the dependent case, where dependence is seized in terms of β-mixing coefficients.
Let us recall here the general definition of β-mixing coefficients from Definition 4. For a
sequence of random variables (Zt )t∈Z (not assumed stationary), the beta-mixing coefficient
of order q is
" #
β(q) = sup E sup |P(B | σ(−∞, t)) − P(B)| .
t∈Z B∈σ(t+q,+∞)

If the sequence (Zt )t∈Z is assumed to be stationary, β(q) may be written as

β(q) = E(dT V (P(Xq ,Xq+1 ,...)|σ(...,X0 ) , P(Xq ,Xq+1 ,...) )),

34
Topological Analysis for Detecting Anomalies in Dependent Data

where dT V denotes the total variation distance. We will make use of the following adaptation
of Bernstein’s inequality to the dependent case.

Theorem 9 (Doukhan, 1994, Theorem 4) Let (Xt )t∈Z be a sequence of (real) variables with
β-mixing coefficients (β(q))q∈N∗ , that satisfies

1. ∀t ∈ Z E(Xt ) = 0,
Pn 2
2. ∀t, n ∈ Z × N E j=1 Xt+j 6 nσ 2 ,

3. ∀t |Xt | 6 M a.s..

Then, for every x > 0,


n r !
1 X x 4M x  n 
P Xt > 2σ + 6 4e−x + 2β d e − 1 .
n n 3n 17
t=1

To apply Theorem 9, a bound on the variance term is needed. Such bounds are available in
the stationary case under slightly milder assumptions (see, e.g., Rio 1993). For our purpose,
a straightforward application of (Rio, 1993, Theorem 1.2, a)) will be sufficient, exposed
below.

Lemma 10 Let Xt denote a centered and stationary sequence of real variables with β-mixing
coefficients (β(q))q∈N∗ , such that |Xt | 6 M a.s..
Then it holds
 2
n Z 1
1  X
E Xj  6 4M 2
β −1 (u)du,
n 0
j=1

where β −1 (u) =
P
k∈N 1β(k)>u .

7.2 Proofs for Section 3


This section gathers the proofs our theoretical results: convergence of the algorithms (Theo-
rem 5 and 6) and type I error control of the subsequent testing procedure (Proposition 7).

7.2.1 Proof of Theorem 5


Proof [Proof of Theorem 5]
We begin by the proof of Theorem 5. It follows the proof of (Chazal et al., 2021, Theorem
9) in the i.i.d. case, with adaptations to cope with dependence using Lemma 8.
To apply Lemma 8, first note that the space M(R, M ), endowed with the Levy-Prokhorov
metric, is a Polish space (see, e.g., Prokhorov 1956, Theorem 1.11). Using Lemma 8 as in
(Doukhan et al., 1995, Proof of Proposition 2) yields the existence of X̃1 , . . . , X̃n such that,
denoting by Yk (resp. Ỹk ) the vector (X(k−1)q+1 , . . . , Xkq ) (resp. (X̃(k−1)q+1 , . . . , X̃kq )), for
1 6 k 6 n/q, it holds:

• For every k > 1 Yk has the same distribution as Ỹk , and P(Ỹk 6= Yk ) 6 β(q).

35
Chazal, Levrard and Royer

• The random variables (Y2k )k>1 are independent, as well as the variables (Y2k−1 )k>1 .
For any c ∈ B(0, R)k , we denote by m̂(c) (resp. m̃(c)) the vector of centroids defined by,
for all j = 1, . . . , k,
¯ (du) u1
h i h i
X̄n (du) u1Wj (c) (u) X̃n Wj (c) (u)
m̂(c)j = , m̃(c)j = ,
p̂j (c) p̃j (c)
¯ (W (c))), adopting the convention
where p̂j (c) (resp. p̃j (c)) denotes X̄n (Wj (c)) (resp. X̃n j
m̂j (c), m̃j (c) = 0 when the corresponding cell weight is null.
The following lemma ensures that m̂(c) contracts toward c∗ , provided c ∈ B(c∗ , R0 ).

Lemma 11 With probability larger than 1 − 16e−c1 npmin /(qM ) − 2e−x , it holds, for every
c ∈ B(c∗ , R0 ),
2 n
!2
3 Dn/q kR 2M 2 1 X
km̂(c) − c∗ k2 6 kc − c∗ k2 + C1 2 + C2 2 1Xi 6=X̃i ,
4 pmin pmin n
i=1
√  p √ 
RM q
where Dn/q = √n k d log(k) + x .

The proof of Lemma 11 is postponed to Section 7.2.3. Equipped with Lemma 11, we first
prove recursively that, if c(0) ∈ B(c∗ , R0 ), then w.h.p., for all t > 0 c(t) ∈ B(c∗ , R0 ). We let
Ω1 be defined as
 !2 
n

2 2 1 X
2 2

Ω1 = C2 kR M 1Xi 6=X̃i /pmin 6 R0 /8 .
 n 
i=1
 P 2  P 
1 n 1 n
Noting that E n i=1 1Xi 6=X̃i 6E n i=1 1Xi 6=X̃i = β(q), Markov inequality yields

kM 2
P(Ωc1 ) 6 C β(q).
κ20 p2min
Choosing x = c1 (n/q)κ20 p2min /M 2 in Lemma 11, for c1 small enough yields, for (n/q) large
enough,
3 R2 R2
km̂(c) − c∗ k2 6 R02 + 0 + 0 = R02 ,
4 8 8
2 2 2 2
with probability larger than 1 − 18e−c1 nκ0 pmin /qM − C κkM 2 p2 β(q), provided c ∈ B(c∗ , R0 ).
0 min
Denoting by Ω2 the probability event onto which the above equation holds, a straightforward
recursion entails that, if c(0) ∈ B(c∗ , R0 ), then, for all t > 1 c(t) = m̂(c(t−1) ) ∈ B(c∗ , R0 ), on
Ω2 .
Then, using Lemma 11 iteratively yields that, on Ω2 ∩ Ωx , where P(Ωcx ) 6 2e−x , for all
t > 1, provided c(0) ∈ B(c∗ , R0 ),
 t 2 n
!2
3 Dn/q kR 2M 2 1 X
kc(t) − c∗ k2 6 kc(0) − c∗ k2 + C1 2 + C2 2 1Xi 6=X̃i . (9)
4 pmin pmin n
i=1

36
Topological Analysis for Detecting Anomalies in Dependent Data

Theorem 5 now easily follows. For the first inequality, let t > 1, then, using Markov
inequality again gives
n
! r
1X p n
P 1Xi 6=X̃i > q/n 6 β(q).
n q
i=1

Thus, the assumption β 2 (q)/q 3 6 n−3 entails that


 t 2
Dn/q
3 kR2 M 2
kc(t) − c∗ k2 6 R02 + C1 2 + C2 2 ,
4 pmin pmin (n/q)
2 2 2 2
with probability larger than 1 − 18e−c1 nκ0 pmin /qM − C κkM
2 p2 β(q) − q/n − 2e−x that is larger
0 min
2
than 1 − C nκqkM
2 p2 − 2e−x .
0 min
For the second inequality, denote by Zt the random variable

Zt =
 !2 
 t 2 2 2 2 2 n
kc(t) − c∗ k2 − 3 kc(0) − c∗ k2 − C1 qR M k d log(k) − C2 kR M 1X
1Xi 6=X̃i  1Ω2 ,
4 np2min p2min n
i=1
+

and remark that (9) entails

R2 M 2 q
 
P Zt > C x 6 P(Ωcx ) 6 2e−x .
n

We deduce that

qR2 M 2
E(Zt ) 6 C ,
n
which leads to

E(kc(t) − c∗ k2 ) 6 E(kc(t) − c∗ k2 1Ω2 ) + 4kR2 M P(Ωc2 )


 t
3 qR2 M 2 k 2 d log(k)
6 E(Zt ) + kc(0) − c∗ k2 + C1
4 np2min
 !2 
n
kR2 M 2  1 X
+ C2 2 E 1Xi 6=X̃i  + 4kR2 M P(Ωc2 ).
pmin n
i=1

Noting that
 !2 
n n
!
1X 1X
E 1Xi 6=X̃i 6E 1Xi 6=X̃i = β(q),
n n
i=1 i=1

37
Chazal, Levrard and Royer

and using
kM 2
 
−c1 nκ20 p2min /qM 2
P(Ωc2 ) 6 e + 2 2 β(q)
κ0 pmin
kM 2
6C (β(q) + (q/n))
κ20 p2min
qkM 2
6C
nκ20 p2min
whenever β(q) 6 q/n leads to the result.

7.2.2 Proof of Theorem 6


Proof [Proof of Theorem 6] This proof follows the steps of (Chazal et al., 2021, Proof of
Lemma 18).
As in the proof of Lemma 11, let X̃1 , . . . , X̃n be such that, denoting by Yk (resp. Ỹk )
the vector (X(k−1)q+1 , . . . , Xkq ) (resp. (X̃(k−1)q+1 , . . . , X̃kq )), for 1 6 k 6 n/q, it holds:

• For every k > 1 Yk has the same distribution as Ỹk , and P(Ỹk 6= Yk ) 6 β(q).

• The random variables (Y2k )k>1 are independent, as well as the variables (Y2k−1 )k>1 .
Let A⊥⊥ denote the event
n o
A⊥⊥ = ∀j = 1, . . . , n/q Yj = Ỹj .

A standard union bound yields that P(Ac⊥⊥ ) 6 nq β(q). On A⊥⊥ , the minibatches used by
Algorithm 4 may be considered as independent, so that the main lines of (Chazal et al.,
2021, Proof of Lemma 18) readily applies, replacing Xi ’s by X̃i ’s. In what follows we let c̃(t)
denote the output of the t-th iteration of Algorithm 4 based on X̃1 , . . . , X̃n .
2
Assume that n > k, and q > C pM 2 log(n), for a large enough constant C that only
R 1 −1 min
depends on 0 β (u)du, to be fixed later. For t 6 n/(4q) = T , let At,1 and At,3 denote the
events
n pmin o
At,1 = ∀j = 1, . . . , k |p̃j (t) − pj (t)| 6 ,
( 128 )
Z r
(t) ¯ M kdpmin
At,3 = ∀j = 1, . . . , k (c̃j − u)1Wj (c̃(t) ) (u)(X̃B (3) − E(X))(du) 6 8R ,
t C

¯
where p̃j (t) = X̃ (t) )). Then, according to Theorem 9 with x = 2 log(n) and Lemma
(1) (Wj (c̃
B
t
10 to bound the corresponding σ, for j ∈ {1, 3}, P(At,j ) 6 4dk/n2 + 2kdβ(qn /18), for n large
enough.
Further, define
\
A6t = Aj,1 ∩ Aj,3 .
j6t

38
Topological Analysis for Detecting Anomalies in Dependent Data

2 2 R1
Then, provided that q > c0 kp2 dMκ2 log(n), where c0 only depends on 0 β −1 (u)du, we may
min 0
prove recursively that

∀p 6 t c̃(p) ∈ B(c∗ , R0 )

on A6t whenever c̃(0) = c(0) ∈ B(c∗ , R0 ) (first step of the proof of (Chazal et al., 2021,
Lemma 18)).
Next, denoting by at = kc̃(t) − c∗ k2 1A6t , we may write

E(at+1 ) 6 E(kc̃(t+1) − c∗ k2 1At+1,1 1A6t ) + R1 ,

with

R1 6 4kR2 P(Act+1,3 )


6 16k 2 dR2 n−2 + β(q/18)




6 32k 2 dR2 (q/n)2 ,

recalling that β(q/18)/q 2 6 n−2 . Proceeding as in (Chazal et al., 2021, Proof of Lemma 18),
we may further bound

12kdM R2
 
2 − K1
E(kc̃(t+1) − c∗ k2 1At+1,1 1A6t ) 6 1 − E(at ) + ,
t+1 pmin (t + 1)2

for some K1 6 0.5. Noticing that k 6 M/pmin and t + 1 6 T = n/(4q) yields that

14kdM R2
 
2 − K1
E(at+1 ) 6 1 − E(at ) + .
t+1 pmin (t + 1)2

Following (Chazal et al., 2021, Proof of Theorem 10), a standard recursion entails

28kdM R2
E(at ) 6 ,
pmin t

for t 6 n/(4q). At last, since kc̃(T ) − c∗ k2 1A⊥⊥ = kc(T ) − c∗ k2 1A⊥⊥ , we conclude that

Ekc(T ) − c∗ k2 6 E(kc̃(T ) − c∗ k2 ) + 4kR2 P(Ac⊥⊥ )


4kR2
6 E(kc̃(T ) − c∗ k2 1A6T ) + 4kR2 P(Ac6T ) +
(n/q)
16k 2 R2 d
6 E(aT ) +
(n/q)
kM R d2
6 128 ,
pmin (n/q)
(n/q)
where k 6 M/pmin and T = 4 have been used.

39
Chazal, Levrard and Royer

7.2.3 Proof of Lemma 11


Proof [Proof of Lemma 11] Assume that c ∈ B(c∗ , R0 ), for some optimal c∗ ∈ Copt . Then,
for any K > 0, it holds

km̂(c) − c∗ k2 6 (1 + K)km̃(c) − c∗ k2 + (1 + K −1 )km̂(c) − m̃(c)k2 . (10)

The first term of the right hand side may be controlled using a slight adaptation of
(Chazal et al., 2021, Lemma 22).
Lemma 12 With probability larger than 1 − 16e−x , for all c ∈ B(0, R)k and j ∈ [[1, k]], it
holds
r
8M c0 q log(k) log(2nNmax ) 8M qx
q
p̃j (c) > pj (c) − pj (c) + ,
n n
8M c0 q log(k) log(2nNmax ) 8M qx
p̃j (c) 6 pj (c) + +
r n n
8M c0 q log(k) log(2nNmax ) 8M qx
q
+ + pj (c),
n n
where c0 is an absolute constant. Moreover, with probability larger than 1 − 2e−x , it holds
Z 
sup ¯ − E(X))(du)
(cj − u)1Wj (c) (u)(X̃n
c∈B(0,R)k j=1,...,k

RM q  p √ 
6 C0 √ k d log(k) + x ,
n
where C0 is an absolute constant.
Proof [Proof of Lemma 12] We intend here to recover the standard i.i.d. bounds given in
(Chazal et al., 2021, Lemma 22). To this aim, we let p̃j,0 (c) and p̃j,1 (c) be defined by
n/2q
2q X ¯
p̃j,r (c) = Ỹ 2s−r (Wj (c)),
n
s=1

for r ∈ {0, 1}, where Ỹ¯ 2s−r = 1q t=(2s−r−1)q+1 X̃t is a measure in M(R, M ), with total
P(2s−r)q
number of support points bounded by qNmax , and remark that
1
p̃j (c) = (p̃j,0 (c) + p̃j,1 (c)) .
2

Since E(Ȳ˜2s−r )(Wj (c)) = pj (c), and the p̃j,r (c)’s are sums of n/2q independent measures
evaluated on Wj (c), we may readily apply (Chazal et al., 2021, Lemma 22) replacing n by
n/(2q) to each of them, leading to the deviation bounds on the p̃j (c)’s.
For the third inequality of Lemma 12, denoting by
n/2q
¯ 2q X ¯
X̃n,j = Ỹ 2s−j ,
n
s=1

40
Topological Analysis for Detecting Anomalies in Dependent Data

for j ∈ {0, 1}, it holds, for any c ∈ B(0, R)k ,


Z 
¯ − E(X))(du)
(cj − u)1Wj (c) (u)(X̃n
j=1,...,k
Z 
1 ¯ − E(X))(du)
6 (cj − u)1Wj (c) (u)(X̃n,0
2 j=1,...,k
Z  !
+ ¯
(cj − u)1Wj (c) (u)(X̃n,1 − E(X))(du) .
j=1,...,k

¯ ’s are i.i.d. sums of discrete measures (the Ỹ¯


Since each of the X̃n,j 2s−j ’s), (Chazal et al.,
2021, Lemma 22) readily applies (with sample size n/(2q)), giving the result.
We now proceed with the first term in (10) as in (Chazal et al., 2021, Proof of Lemma 17).
Using the first two inequalities of Lemma 12 with x = c1 npmin /M yields a probability event
Ω1 onto which
63 pmin 31
p̃j (c) > pj (c) − > pmin ,
64 64 32
33
p̃j (c) 6 pj (c∗ ).
32
Combining this with the inequality of Lemma 12 yields, for n large enough and all c ∈
B(c∗ , R0 ), with probability larger than 1 − 16e−c1 npmin /n − 2e−x ,
C
km̃(c) − c∗ k2 6 0.65kc − c∗ k2 + D2 . (11)
pmin n/q
2

The precise derivation of (11) may be found in (Chazal et al., 2021, Proof of Lemma 17,
pp.34-35). Plugging (11) into (10) leads to, for a small enough K,
3 C
km̂(c) − c∗ k2 6 kc − c∗ k2 + 2 Dn/q
2
+ C2 km̂(c) − m̃(c)k2 ,
4 pmin

with probability larger than 1 − 16e−c1 npmin /n − 2e−x .


It remains to control the last term km̂(c) − m̃(c)k2 . To do so, note that, for every
j = 1, . . . , j,
n
1 X
|p̂j (c) − p̃j (c)| = Xi (Wj (c)) − X̃i (Wj (c))
n
i=1
n
M X
6 1Xi 6=X̃i , (12)
n
i=1

and
n
¯
h i 2RM X
(X̄n (du) − X̃n (du)) u1Wj (c) (u) 6 1Xi 6=X̃i .
n
i=1

41
Chazal, Levrard and Royer

On Ω1 , it holds, for every j = 1, . . . , k,

¯ (du) u1
h i h i
X̄n (du) u1Wj (c) (u) X̃n Wj (c) (u)
km̂j (c) − m̃j (c)k = −
p̂j (c) p̃j (c)
h i 1 1
6 X̄n (du) u1Wj (c) (u) −
p̂j (c) p̃j (c)
1 ¯ (du)) u1
h i
+ (X̄n (du) − X̃n Wj (c) (u)
p̃j (c)
n
|p̂j (c) − p̃j (c)| 2RM X
6R + 1Xi 6=X̃i
p̃j (c) np̃j (c)
i=1
n
RM X
6C 1Xi 6=X̃i .
npmin
i=1

Squaring and taking the sum with respect to j gives the result.

7.2.4 Proof of Proposition 7


Proof [Proof of Proposition 7] Let Z1 , . . . , Zn denote the sequence kṽ1 k, . . . , kṽn k, that is a
stationary β-mixing sequence of real-valued random variables. For t ∈ R, we let

n
1X
Fn (t) = 1Zi >t ,
n
i=1

F (t) = P(Z > t), and p t̂ be such that Fn (t̂) 6 α − δ. In the i.i.d. case, we might bound
supt (F (t) − Fn (t))/ Fn (t) using a standard inequality such as in (Boucheron et al., 2005,
Section 5.1.2).
As for the proofs of Theorem 5 and 6, we compare with the i.i.d. case by introducing
auxiliary variables.
We let Z̃1 , . . . , Z̃n be such that, denoting by Yk (resp. Ỹk ) the vector (Z(k−1)q+1 , . . . , Zkq )
(resp. (Z̃(k−1)q+1 , . . . , Z̃kq )), it holds

• For every 1 6 k 6 n/q Yk ∼ Ỹk , and P(Yk 6= Ỹk ) 6 β(q).

• (Y2k )k>1 are independent, as well as (Y2k−1 )k>1 .


Pn
Let F̃n (t) denote i=1 1Z̃i >t . Then, for any t ∈ R, we have

n
1X
F̃n (t) 6 Fn (t) + 1Zi 6=Z̃i .
n
i=1

42
Topological Analysis for Detecting Anomalies in Dependent Data

n P q o
If Ω1 is the event n1 ni=1 1Zi 6=Z̃i 6 αq
p
n , that has probability larger than 1−β(q) n/(αq)
(using Markov inequality as before), then on Ω1 it readily holds
r
αq
F̃n (t̂) 6 Fn (t̂) +
n
r
αq
6α−δ+ ,
n
so that we may write, on the same event,

F (t̂) = F (t̂) − F̃n (t̂) + F̃n (t̂)


r
αq
6α−δ+ + (F (t̂) − F̃n (t̂)). (13)
n

It remains to control the stochastic term (F (t̂) − F̃n (t̂)). To do so, we denote by
(2j−1)q
1 X
X̃j,0 (t) = 1Z̃i >t ,
q
i=2(j−1)q+1
2jq
1 X
X̃j,1 (t) = 1Z̃i >t ,
q
i=(2j−1)q+1

for j ∈ [[1, n/(2q)]] and t ∈ R. Note that, for any σ ∈ {0, 1}, X̃j,σ ’s are i.i.d., take values in
[0, 1], and have expectation F (t). Next, we define, for 1 6 j 6 n/2q and t ∈ R,
n/2q
2q X
F̃n,0 (t) = X̃j,0 (t),
n
j=1
n/2q
2q X
F̃n,1 (t) = X̃j,1 (t),
n
j=1

and we note that F̃n (t) = 12 (F̃n,0 (t) + F̃n,1 (t)). Since the F̃n,σ ’s are sums of i.i.d. random
variables, the following concentration bound follows.

Lemma 13 For j ∈ {0, 1}, and x such that (n/2q)x2 > 1, it holds
!
(F (t) − F̃n,j (t)) 2
P sup p > 2x 6 2ne−(n/2q)x .
t∈R F (t)

Aqproof of Lemma 13 is postponed to the following Section 7.2.5. Now, choosing x =


2 q log(n)
p
n entails that, with probability larger than 1 − 4(q/n) − β(q) n/(αq),
r r
αq q log(n)
q
F (t̂) 6 α − δ + +4 F (t̂),
n n

43
Chazal, Levrard and Royer

which leads to
r s r
q log(n) αq q log(n)
q
F (t̂) 6 2 + α−δ+ +4 .
n n n
√ √
q q
Choosing δ > 4 α q log(n)
n + αq
n ensures that the right-hand side is smaller than α.

7.2.5 Proof of Lemma 13


Proof [Proof of Lemma 13] The proof follows the one of (Chazal et al., 2021, Lemma 22)
verbatim, at the exception of the capacity bound, that we discuss now. To lighten notation
we assume that we have a n/(2q) sample of Yi ’s, with Yi = (Z(i−1)q+1 , . . . , Ziq ) ∈ Rq , and
we consider the set of functionals
q
( )
1X
F = y = (z1 , . . . , zq ) 7→ 1zi >t .
q
i=1

Following
 (Chazal et al., 2021, Lemma 22), if SF (y1 , . . . , yn/(2q) ) denotes the cardinality of
f (y1 ), . . . , f (yn/(2q) ) | f ∈ F , we have to bound

SF (Y1 , . . . , Yn/(2q) , Y10 , . . . , Yn/(2q)


0
),

where the Yi0 ’s are i.i.d. copies of the Yi ’s. Since, for every y1 , . . . , yn/q , recalling that
yi = (z(i−1)q+1 , . . . , ziq ), it holds

SF (y1 , . . . , yn/q ) 6 {(1z1 >t , . . . , 1zn >t ) | t ∈ R} ,


we deduce that
SF (Y1 , . . . , Yn/(2q) , Y10 , . . . , Yn/(2q)
0
) 6 n.
The remaining of the proof follows verbatim (Chazal et al., 2021, Lemma 22).

References
Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick
Shipman, Sofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier.
Persistence images: a stable vector representation of persistent homology. J. Mach. Learn.
Res., 18:Paper No. 8, 35, 2017. ISSN 1532-4435.
Alekh Agarwal and John C. Duchi. The generalization ability of online algorithms for depen-
dent data. IEEE Trans. Inform. Theory, 59(1):573–587, 2013. ISSN 0018-9448,1557-9654.
doi: 10.1109/TIT.2012.2212414. URL https://doi.org/10.1109/TIT.2012.2212414.
Sylvain Arlot, Alain Celisse, and Zaid Harchaoui. A kernel multiple change-point algorithm
via model selection. J. Mach. Learn. Res., 20:Paper No. 162, 56, 2019. ISSN 1532-
4435,1533-7928.

44
Topological Analysis for Detecting Anomalies in Dependent Data

Jean-Daniel Boissonnat, Frédéric Chazal, and Mariette Yvinec. Geometric and Topological
Inference, volume 57. Cambridge University Press, 2018.

Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: a survey
of some recent advances. ESAIM Probab. Stat., 9:323–375, 2005. ISSN 1292-8100,1262-3318.
doi: 10.1051/ps:2005018. URL https://doi.org/10.1051/ps:2005018.

Anass El Yaagoubi Bourakna, Moo K. Chung, and Hernando Ombao. Modeling and
simulating dependence in networks using topological data analysis, 2022.

Claire Brécheteau, Aurélie Fischer, and Clément Levrard. Robust Bregman clustering.
The Annals of Statistics, 49(3):1679 – 1701, 2021. doi: 10.1214/20-AOS2018. URL
https://doi.org/10.1214/20-AOS2018.

L Breiman. Random forests. 45:5–32, 2001-10. doi: 10.1023/A:1010950718922.

P. Bruillard, K. Nowak, and E. Purvine. Anomaly detection using persistent homology.


In 2016 Cybersecurity Symposium (CYBERSEC), pages 7–12, Los Alamitos, CA, USA,
apr 2016. IEEE Computer Society. doi: 10.1109/CYBERSEC.2016.009. URL https:
//doi.ieeecomputersociety.org/10.1109/CYBERSEC.2016.009.

Mathieu Carrière, Frédéric Chazal, Yuichi Ike, Théo Lacombe, Martin Royer, and Yuhei
Umeda. Perslay: A neural network layer for persistence diagrams and new graph topological
signatures. In International Conference on Artificial Intelligence and Statistics, pages
2786–2796. PMLR, 2020.

Frédéric Chazal and Vincent Divol. The density of expected persistence diagrams and its
kernel based estimation. In 34th International Symposium on Computational Geometry,
volume 99 of LIPIcs. Leibniz Int. Proc. Inform., pages Art. No. 26, 15. Schloss Dagstuhl.
Leibniz-Zent. Inform., Wadern, 2018.

Frédéric Chazal and Bertrand Michel. An introduction to topological data analysis: funda-
mental and practical aspects for data scientists. Frontiers in artificial intelligence, 4:108,
2021.

Frédéric Chazal, Vin de Silva, and Steve Oudot. Persistence stability for geometric complexes.
Geometriae Dedicata, 173(1):193–214, 2014.

Frédéric Chazal, Vin de Silva, Marc Glisse, and Steve Oudot. The structure and stability of
persistence modules. Springer International Publishing, 2016.

Frédéric Chazal, Clément Levrard, and Martin Royer. Clustering of measures via mean
measure quantization. Electronic Journal of Statistics, 15(1):2060 – 2104, 2021. doi:
10.1214/21-EJS1834. URL https://doi.org/10.1214/21-EJS1834.

Stéphane Chrétien, Ben Gao, Astrid Thebault-Guiochon, and Rémi Vaucher. Time topologi-
cal analysis of eeg using signature theory, 2024.

Meryll Dindin, Yuhei Umeda, and Frederic Chazal. Topological data analysis for arrhythmia
detection through modular neural networks, 2019.

45
Chazal, Levrard and Royer

Vincent Divol and Frédéric Chazal. The density of expected persistence diagrams and its
kernel based estimation. Journal of Computational Geometry, 10(2):127–153, 2019.

P. Doukhan, P. Massart, and E. Rio. Invariance principles for absolutely regular empirical
processes. Ann. Inst. H. Poincaré Probab. Statist., 31(2):393–427, 1995. ISSN 0246-0203.
URL http://www.numdam.org/item?id=AIHPB_1995__31_2_393_0.

Paul Doukhan. Mixing, volume 85 of Lecture Notes in Statistics. Springer-Verlag, New York,
1994. ISBN 0-387-94214-9. doi: 10.1007/978-1-4612-2642-0. URL https://doi.org/10.
1007/978-1-4612-2642-0. Properties and examples.

Herbert Edelsbrunner and John Harer. Computational topology: an introduction. American


Mathematical Society, 2010.

Petri G., Expert P., Turkheimer F., Carhart-Harris R., Nutt D., Hellyer P.J., and Vaccarino
F. Homological scaffolds of brain functional networks. Journal of the Royal Society
Interface, 11(101), 2014. doi: 10.1098/rsif.2014.0873.

Niklas Heim and James E. Avery. Adaptive anomaly detection in chaotic time series
with a spatially aware echo state network. ArXiv, abs/1909.01709, 2019. URL https:
//api.semanticscholar.org/CorpusID:202541761.

Steffen Herbold. Autorank: A python package for automated ranking of classifiers. Journal
of Open Source Software, 5(48):2173, 2020. doi: 10.21105/joss.02173. URL https:
//doi.org/10.21105/joss.02173.

Thi Kieu Khanh Ho, Ali Karami, and Narges Armanfard. Graph-based time-series anomaly
detection: A survey. arXiv preprint arXiv:2302.00058, 2023.

Christoph D Hofer, Roland Kwitt, and Marc Niethammer. Learning representations of


persistence barcodes. Journal of Machine Learning Research, 20(126):1–45, 2019.

Mia Hubert, Michiel Debruyne, and Peter J. Rousseeuw. Minimum covariance determinant
and extensions. Wiley Interdiscip. Rev. Comput. Stat., 10(3):e1421, 11, 2018. ISSN 1939-
5108,1939-0068. doi: 10.1002/wics.1421. URL https://doi.org/10.1002/wics.1421.

Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul.
Exathlon: a benchmark for explainable anomaly detection over time series. 14(11):
2613–2626, 2021-07. ISSN 2150-8097. doi: 10.14778/3476249.3476307. URL https:
//doi.org/10.14778/3476249.3476307.

Soham Jana, Jianqing Fan, and Sanjeev Kulkarni. A general theory for robust clustering via
trimmed mean, 2024.

Clément Levrard. Nonasymptotic bounds for vector quantization in hilbert spaces. Ann.
Statist., 43(2):592–619, 04 2015. doi: 10.1214/14-AOS1293. URL https://doi.org/10.
1214/14-AOS1293.

Clément Levrard. Quantization/clustering: when and why does k-means work. Journal de
la Société Française de Statistiques, 159(1), 2018.

46
Topological Analysis for Detecting Anomalies in Dependent Data

Shuya Li, Wenbin Song, Chao Zhao, Yifeng Zhang, Weiming Shen, Jing Hai, Jiawei Lu, and
Yingshi Xie. An anomaly detection method for multiple time series based on similarity
measurement and louvain algorithm. Procedia Computer Science, 200:1857–1866, 2022.

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE
International Conference on Data Mining, pages 413–422, 2008. doi: 10.1109/ICDM.2008.
17.

Hendrik P. Lopuhaa and Peter J. Rousseeuw. Breakdown Points of Affine Equivariant


Estimators of Multivariate Location and Covariance Matrices. The Annals of Statistics,
19(1):229 – 248, 1991. doi: 10.1214/aos/1176347978. URL https://doi.org/10.1214/
aos/1176347978.

Daniel J. McDonald, Cosma Rohilla Shalizi, and Mark Schervish. Estimating beta-mixing
coefficients via histograms. Electron. J. Stat., 9(2):2855–2883, 2015. ISSN 1935-7524. doi:
10.1214/15-EJS1094. URL https://doi.org/10.1214/15-EJS1094.

Ahmed Hossam Mohammed, Mercedes Cabrerizo, Alberto Pinzon, Ilker Yaylali, Prasanna
Jayakar, and Malek Adjouadi. Graph neural networks in eeg spike detection. Artificial
Intelligence in Medicine, 145:102663, 2023. ISSN 0933-3657. doi: https://doi.org/10.1016/
j.artmed.2023.102663. URL https://www.sciencedirect.com/science/article/pii/
S093336572300177X.

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and
β-mixing processes. J. Mach. Learn. Res., 11:789–814, 2010. ISSN 1532-4435,1533-7928.

Hernando Ombao and Marco Pinto. Spectral dependence, 2021.

John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J.
Franklin. Volume under the surface: a new accuracy evaluation measure for time-series
anomaly detection. Proc. VLDB Endow., 15(11):2774–2787, jul 2022a. ISSN 2150-8097.
doi: 10.14778/3551793.3551830. URL https://doi.org/10.14778/3551793.3551830.

John Paparrizos, Yuhao Kang, Paul Boniol, Ruey S. Tsay, Themis Palpanas, and Michael J.
Franklin. Tsb-uad: an end-to-end benchmark suite for univariate time-series anomaly
detection. Proc. VLDB Endow., 15(8):1697–1711, apr 2022b. ISSN 2150-8097. doi:
10.14778/3529337.3529354. URL https://doi.org/10.14778/3529337.3529354.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-


del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.

Tuan D. Pham and Lanh T. Tran. Some mixing properties of time series models. Stochas-
tic Process. Appl., 19(2):297–303, 1985. ISSN 0304-4149,1879-209X. doi: 10.1016/
0304-4149(85)90031-6. URL https://doi.org/10.1016/0304-4149(85)90031-6.

Yu. V. Prokhorov. Convergence of random processes and limit theorems in probability


theory. Teor. Veroyatnost. i Primenen., 1:177–238, 1956. ISSN 0040-361X.

47
Chazal, Levrard and Royer

Nalini Ravishanker and Renjie Chen. Topological data analysis (tda) for time series, 2019.

Ian W. Renner, Jane Elith, Adrian Baddeley, William Fithian, Trevor Hastie, Steven J.
Phillips, Gordana Popovic, and David I. Warton. Point process models for presence-
only analysis. Methods in Ecology and Evolution, 6(4):366–379, 2015. doi: 10.1111/
2041-210X.12352. URL https://besjournals.onlinelibrary.wiley.com/doi/abs/
10.1111/2041-210X.12352.

Emmanuel Rio. Covariance inequalities for strongly mixing processes. Ann. Inst. H. Poincaré
Probab. Statist., 29(4):587–597, 1993. ISSN 0246-0203. URL http://www.numdam.org/
item?id=AIHPB_1993__29_4_587_0.

Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum
covariance determinant estimator. Technometrics, 41(3):212–223, 1999. doi: 10.
1080/00401706.1999.10485670. URL https://www.tandfonline.com/doi/abs/10.1080/
00401706.1999.10485670.

Martin Royer, Frederic Chazal, Clément Levrard, Yuhei Umeda, and Yuichi Ike. Atol:
Measure vectorization for automatic topologically-oriented learning. In Arindam Baner-
jee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning
Research, pages 1000–1008. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.
press/v130/royer21a.html.

Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly detection in time
series: A comprehensive evaluation. Proceedings of the VLDB Endowment (PVLDB), 15
(9):1779–1797, 2022. doi: 10.14778/3538598.3538602.

Sondre Sørbø and Massimiliano Ruocco. Navigating the metric maze: a taxonomy of
evaluation metrics for anomaly detection in time series. Data Mining and Knowledge
Discovery, pages 1–42, 11 2023. doi: 10.1007/s10618-023-00988-8.

Cheng Tang and Claire Monteleoni. On lloyd’s algorithm: New theoretical insights for
clustering in practice. In Arthur Gretton and Christian C. Robert, editors, Proceedings of
the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of
Proceedings of Machine Learning Research, pages 1280–1289, Cadiz, Spain, 09–11 May
2016. PMLR. URL http://proceedings.mlr.press/v51/tang16b.html.

The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015.
URL http://gudhi.gforge.inria.fr/doc/latest/.

Yuhei Umeda, Junji Kaneko, Hideyuki Kikuchi, and Dr. Kikuchi. Topological data
analysis and its application to time-series data analysis. 2019. URL https://api.
semanticscholar.org/CorpusID:225065707.

Phillip Wenig, Sebastian Schmidl, and Thorsten Papenbrock. Timeeval: A benchmarking


toolkit for time series anomaly detection algorithms. Proceedings of the VLDB Endowment
(PVLDB), 15(12):3678–3681, 2022. doi: 10.14778/3554821.3554873.

48
Topological Analysis for Detecting Anomalies in Dependent Data

Weichen Wu, Jisu Kim, and Alessandro Rinaldo. On the estimation of persistence intensity
functions and linear representations of persistence diagrams. In International Conference
on Artificial Intelligence and Statistics, pages 3610–3618. PMLR, 2024.

Takehisa Yairi, Yoshikiyo Kato, and Koichi Hori. Fault detection by mining association
rules from house-keeping data. 2001.

Yu Zheng, Huan Yee Koh, Ming Jin, Lianhua Chi, Khoa T Phan, Shirui Pan, Yi-Ping Phoebe
Chen, and Wei Xiang. Correlation-aware spatial–temporal graph learning for multivariate
time-series anomaly detection. IEEE Transactions on Neural Networks and Learning
Systems, 2023.

49

You might also like