Abstract
The topic of this paper is testing the assumption of exchangeability, which is the standard assumption in mainstream machine learning. The common approaches are online testing by betting (such as conformal testing) and the older batch testing using p-values (as in classical hypothesis testing). The approach of this paper is intermediate in that we are interested in batch testing by betting; as a result, p-values are replaced by e-values. As a first step in this direction, this paper concentrates on the Markov model as alternative. The null hypothesis of exchangeability is formalized as a Kolmogorov-type compression model, and the Bayes mixture of the Markov model w.r. to the uniform prior is taken as simple alternative hypothesis. Using e-values instead of p-values leads to a computationally efficient testing procedure. Two appendixes discuss connections with the algorithmic theory of randomness; in particular, the test proposed in this paper can be interpreted as a poor man’s version of Kolmogorov’s deficiency of randomness.
Similar content being viewed by others
1 Introduction
Exchangeability is the fundamental assumption in machine learning. Traditional machine learning studies prediction under exchangeability (see, e.g., Vapnik, 1998), while newer methods consider deviations from exchangeability (see, e.g., Quiñonero-Candela et al., 2009). The role of exchangeability in conformal prediction, as subarea of machine learning, is briefly reviewed in (Vovk et al., 2022, Sect. 13.5.1).
Testing the assumption of exchangeability is a traditional topic in conformal prediction (see, e.g., Vovk et al., 2022, Part III). It is done in the online mode and is based on conformal test martingales. This area is often referred to as conformal testing.
The classical approach to testing exchangeability, which developed in statistics starting from at least 1943 (Wald & Wolfowitz, 1943), proceeds in the batch mode: we are given the data sequence as one batch rather than getting its elements sequentially one by one; see (Lehmann, 2006, Sect. 7.2) for a review. As always in classical hypothesis testing, testing exchangeability in the batch mode is based on p-values.
In this paper we will adapt standard methods of conformal testing to testing exchangeability in the batch mode. In particular, p-values will be replaced by e-values (Grünwald et al., 2023; Vovk & Wang, 2021), which are widely used in conformal testing: namely, conformal test martingales are obtained by compounding e-values. An important advantage of e-values is that their use facilitates efficient computations.
The null hypothesis of exchangeability will be defined in Sect. 2 using the terminology of compression modelling, widely used in conformal prediction (Vovk et al., 2022, Chap. 11). Compression modelling is an algorithm-free version of Kolmogorov’s way of stochastic modelling: cf. Vovk (2001), Vovk and Shafer (2003), V’yugin (2019, Sect. 2), and Vovk et al. (2022, Sect. 11.6.1). Kolmogorov’s original version will be discussed in Appendix 1.
In Sect. 2 we also define e-variables, which are functions for producing e-values in testing exchangeability (or another null hypothesis). We will derive our main e-variable as likelihood ratio for a Markovian alternative hypothesis, which we will introduce in Sect. 4. A simple optimality property of the likelihood ratios is derived in Sect. 3.
After defining our main alternative hypothesis in Sect. 4, we derive an efficient algorithm for computing the corresponding e-variable. The power of this e-variable is the topic of Sect. 5. The algorithm’s performance in view of the results of Sect. 5 is studied in Sect. 6 using simulated data. Section 7 concludes.
Appendix 1 describes Kolmogorov’s original ideal picture of algorithmic randomness. In the following Appendix 2 we will discuss possible ways of making this picture more practical.
2 Testing exchangeability
We consider the simplest binary case, and our observation space is \(\textbf{Z}:=\{0,1\}\). Fix an integer \(N>1\), which we will refer to as the time horizon. We are interested in binary data sequences \((z_1,\dots ,z_N)\in \Omega :=\textbf{Z}^N\). A Kolmogorov compression model (KCM) is a summarising statistic \(t:\Omega \rightarrow \Sigma\), where \(\Sigma\) is a finite set (the summary space), together with the implicit statement that given the summary \(t(z_1,\dots ,z_N)\) (for which we do not make any stochastic assumptions) the actual data sequence \((z_1,\dots ,z_N)\) is generated from the uniform probability measure on the set \(t^{-1}(t(z_1,\dots ,z_N))\) of all data sequences compatible with the summary. Our null hypothesis is the KCM, which we call the exchangeability compression model (ECM), \(t_E(z_1,\dots ,z_N):=z_1+\dots +z_N\). (In the current binary case this is equivalent to the more standard definition
used in (Vovk et al., 2022, Sect. 11.3.1), where
stands , for a multiset. )
KCM and ECM are two of the three main classes of models used in this paper. The third, largest, class will be introduced later in this section and called BCM. Therefore, the inclusions between the classes will be
Let us say that a probability measure P on \(\Omega\) agrees with a summarising statistic t if the data sequences with the same summary have the same P-probability. A probability measure P on \(\Omega\) is exchangeable if \(P(\{(z_1,\dots ,z_N)\})\) depends on \(z_1,\dots ,z_N\) only via \(z_1+\dots +z_N\) (equivalently, via
).
Lemma 1
The exchangeable probability measures on \(\Omega\) are exactly the probability measures that agree with the ECM (the mixtures of the uniform probability measures on \(t_E^{-1}(k)\), \(k\in \{0,\dots ,N\}\)).
The easy proof of Lemma 1 is omitted. It shows that, in terms of standard statistical modelling, we can define our null hypothesis as the set of all exchangeable probability measures on \(\Omega\).
An e-variable w.r. to a probability measure is a nonnegative function on \(\Omega\) with expectation at most 1. An exchangeability e-variable is a function \(E:\Omega \rightarrow [0,\infty )\) whose average over each \(t_E^{-1}(k)\) is at most 1. Such a function E can be used for testing the assumption of exchangeability: if E is chosen in advance, observing a very large \(E(\omega )\) for the realized outcome \(\omega \in \Omega\) casts doubt on the exchangeability assumption.
Alternatively (and equivalently), an exchangeability e-variable may be defined as an e-variable w.r. to every exchangeable probability measure.
Proposition 2
The two meanings of an exchangeability e-variable coincide.
Proof
If the average of E over each \(t_E^{-1}(k)\) is at most 1, it will be an e-variable w.r. to each exchangeable probability measure by Lemma 1.
Now suppose E is an e-variable w.r. to each exchangeable probability measure. Since the uniform probability measure on \(t_E^{-1}(k)\) is exchangeable, the average of E over \(t_E^{-1}(k)\) will be at most 1. \(\square\)
All null hypotheses discussed in this paper will be KCMs. In the main part of the paper we will concentrate on the ECM, but in this and next sections we will also give more general definitions. An e-variable w.r. to a KCM t is a function \(E:\Omega \rightarrow [0,\infty )\) such that the arithmetic mean of E over \(t^{-1}(\sigma )\) is at most 1 for every \(\sigma \in t(\Omega )\). E-values are values taken by e-variables.
2.1 Disintegration of the alternative hypothesis
Let us fix a simple alternative hypothesis Q, which is a probability measure on \(\Omega\). Our statistical procedures will depend on Q only via the corresponding batch compression model (BCM). A BCM is a pair (t, P) such that \(t:\Omega \rightarrow \Sigma\) is a summarising statistic and \(P:\Sigma \hookrightarrow \Omega\) [to use the notation of (Vovk et al., 2022, Sect. A.4)] is a Markov kernel such that \(P(\sigma )\) is concentrated on \(t^{-1}(\sigma )\) for each \(\sigma \in \Sigma\). As before, we refer to \(t(\omega )\) as the summary of \(\omega\). Kolmogorov compression models are a special case in which each \(P(\sigma )\) is the uniform probability measure on \(t^{-1}(\sigma )\).
Remark 1
Batch compression models are standard and are often used without giving them any name, as in Lauritzen (1988). They are the batch counterpart of online compression models used in conformal prediction (Vovk et al., 2022, Chap. 11). The three classes shown in (1) are used in different contexts in this paper: general BCMs serve as alternative hypotheses, the null hypothesis of interest in the main part of the paper is the ECM, and in the appendix we will discuss more general KCMs as null hypotheses.
With an alternative hypothesis Q and a summarising statistic \(t:\Omega \rightarrow \Sigma\) (serving as null hypothesis) we associate the alternative Markov kernel \(\sigma \in \Sigma \mapsto Q_{\sigma }\) defined by
(We are mainly interested in alternative hypotheses Q for which the denominator of (2) is always positive, but in general we could set, e.g., \(0/0:=1/2\) in our binary context.) As compared with Q, the alternative Markov kernel loses the information about \(Q(t^{-1}(\sigma ))\) for \(\sigma \in \Sigma\). (And of course, the reader should keep in mind that alternative Markov kernels and Markov alternative hypotheses are completely different objects, despite both being named after Andrei Andreevich Markov Sr.)
3 Frequentist performance of e-variables
Suppose Q (the alternative probability measure) is the true data-generating distribution, and we keep generating data sequences \((z_1,\dots ,z_N)\in \Omega\) from Q in the IID fashion. The following lemma allows us to define the efficiency of an e-variable via its frequentist performance when we keep applying it repeatedly to accumulate capital. This is a special case of Kelly’s criterion (Kelly, 1956).
Lemma 3
Consider an e-variable E w.r. to a Kolmogorov compression model \(t:\Omega \rightarrow \Sigma\). For any alternative probability measure Q on \(\Omega\), the limitFootnote 1
where \((z_1^i,\dots ,z_N^i)\) is the ith data sequence generated from Q independently, exists \(Q^{\infty }\)-almost surely. Moreover, for all E and Q,
The interpretation of (3) is that our capital \(\prod _{i=1}^I E(z_1^i,\dots ,z_N^i)\) grows exponentially fast when betting repeatedly using E (we will see later, in Lemma 4, that we can indeed expect it to grow rather than shrink if we can guess a good Q), and its rate of growth is given by the expression (4), which we will refer to as the e-power of E under the alternative Q.
Proof
It suffices to rewrite (3) as
and apply Kolmogorov’s law of large numbers to the IID random variables \(\ln E(z_1^i,\dots ,z_N^i)\) with expectation \(\int \ln E \,\textrm{d}Q\) (which exists and is finite since the sample space is assumed to be finite). \(\square\)
To justify the expression (4) using frequentist considerations, we do not really need the IID picture, as emphasized by Neyman (Neyman, 1977, Sect. 10). When generating \(z_1^i,\dots ,z_N^i\) for different i, we may test different Kolmogorov compression models \(t=t_i\), perhaps with different time horizons \(N=N_i\), against different alternatives \(Q=Q_i\) and using different \(E_i\). The corresponding generalization of Lemma 3 states that the long-term rate of growth of our capital will be asymptotically close to the arithmetic average of \(\int \ln E_i \,\textrm{d}Q_i\). It will involve certain regularity conditions needed for the applicability of the martingale strong law of large numbers [e.g., in the form of (Shafer & Vovk 2019, Chap. 4), which allows non-stochastic choice of \(N_i\), \(t_i\), \(Q_i\), and \(E_i\)]. If the alternative hypothesis does not hold in all trials, Lemma 3 is still applicable to the trials where it does hold.
Now it is easy to find the optimal, in the sense of \({{\,\textrm{ep}\,}}_Q\), e-variable; it will be the ratio of the alternative Markov kernel to the null hypothesis.
Lemma 4
The maximum of \({{\,\textrm{ep}\,}}_Q\) is attained at
In this case,
where \(t_*Q\) (a probability measure on the summary space \(\Sigma\)) is the push-forward measure
of Q by t (the summarising statistic of the null hypothesis), and \(H(\cdot )\) stands for the entropy.
We will call \({{\,\textrm{mep}\,}}(Q)\) defined by (6) the maximum e-power of the alternative Q. A sizeable \({{\,\textrm{mep}\,}}(Q)\) for a plausible alternative Q means that the testing problem is not hopeless and has some potential.
The guarantee given by Lemma 3, however, is frequentist and not applicable if testing is done only once, in which case we also want the optimal e-variable (5) not to be too volatile.
Proof
In this paper we let \(U_A\) stand for the uniform probability measure on a finite non-empty set A. The optimization \(\int E \,\textrm{d}Q\rightarrow \max\) can be performed inside each block \(t^{-1}(\sigma )\) separately. Using the nonnegativity of the Kullback–Leibler divergence, we have, for each \(\sigma \in t(\Omega )\),
for each e-variable \(E'\) w.r. to t, which implies the first statement (about (5)) of the lemma. The second statement (6) follows from
where \({{\,\mathrm{\textrm{KL}}\,}}\) stands for the Kullback–Leibler divergence. \(\square\)
4 An explicit algorithm for Markov alternatives
Starting from this section we will consider a specific alternative hypothesis obtained by mixing Markov probability measures. The corresponding exchangeability e-variable will be computable in linear time, O(N).
First let us fix some terminology. The exchangeability summary, or exchangeability type, of a data sequence \(z_1,\dots ,z_N\) is the numbers \((N_0,N_1)\) of 0 s and 1 s in it. (It carries the same information as just the number of 1 s, but we prefer a symmetric definition despite some redundancy.) By a “substring” we always mean a contiguous substring. The Markov type of \(z_1,\dots ,z_N\) is the sextuple \((F,N_{00},N_{01},N_{10},N_{11},L)\), where \(N_{i,j}\) is the number of times (i, j) occurs as substring in the sequence \(z_1,\dots ,z_N\) (with the comma often omitted), and F and L are the first and last bits of the sequence.
As our alternative hypothesis, we will take the uniform mixture of the Markov probability measures, defined as follows: \(\pi _{01}\) and \(\pi _{10}\) are generated independently from the uniform distribution \(U_{[0,1]}\) on [0, 1]; the first bit is chosen as 1 with probability 1/2, and after that each 0 is followed by 1 with probability \(\pi _{01}\), and each 1 is followed by 0 with probability \(\pi _{10}\). Let us compute the probability of a sequence of a Markov type \((F,N_{00},\dots ,N_{11},L)\) under this probability measure:
where \(N_{i*}:=N_{i,0}+N_{i,1}\). If \(N_{1-F}=0\), this probability is \(\frac{1}{2N}\) (which in fact agrees with the general expression (7)). We will refer to (7) as the UMM probability measure, or UMM alternative, where “UMM” stands for “uniformly mixed Markov”.
The uniform prior in (7) is used for mathematical convenience and computational efficiency, and it is discussed in greater detail at the end of Appendix 2.
For future use, set \(\pi _{00}:=1-\pi _{01}\) and \(\pi _{11}:=1-\pi _{10}\).
Following (Vovk et al., 2022, Chap. 9), which in turn follows (Ramdas et al., 2022), let us define the lower benchmark
as the ratio of the UMM alternative (7) to the maximum likelihood under the IID model (which consists of the IID probability measures \(B^N\), B being a probability measure on \(\{0,1\}\)). The idea behind the lower benchmark is that, for any IID probability measure \(B^N\), it is an e-variable w.r. to \(B^N\), i.e., satisfies \(\int \textrm{LB}\,\textrm{d}B^N\le 1\).
However, the IID model is not our null hypothesis, and our null hypothesis of exchangeability is slightly more challenging. Replacing in (8) the maximum likelihood over the IID model by the maximum likelihood over the exchangeable probability measures, we obtain the exchangeability lower benchmark
The exchangeability lower benchmark (9) is a bona fide exchangeability e-variable.
However, our main object of interest in this paper is the more efficient (in the sense of its e-power) e-variable given by Lemma 4 with t being the exchangeability model and Q being the UMM alternative (7). We will refer to this optimal e-variable as the uniformly mixed Markov (UMM) e-variable. A more explicit expression for it and a way of computing it are given below as (14) and Algorithm 1, respectively.
Remark 2
In the spirit of (Koning 2024, Theorem 2) the value of the UMM e-variable on a data sequence \(z_1,\dots ,z_N\) can be written as
where Q is given by (7) and \(\sigma\) ranges over the permutations of \(\{1,\dots ,N\}\). Indeed, the denominator of (10) equals the average of \(Q(\{\omega \})\) over \(\omega \in t^{-1}(t(z_1,\dots ,z_N))\), and so the whole expression (10) equals (5) for \(\omega =(z_1,\dots ,z_N)\) (and t the exchangeability model).
In fact the UMM e-variable dominates the exchangeability lower benchmark. Indeed, the exchangeability lower benchmark replaces the right-hand side of (5) by \(\left| t^{-1}(t(\omega ))\right| Q(\{\omega \})\), and so ignores the denominator in (2). Namely, we have
For the e-power of the exchangeability lower benchmark we have the formula (6) with the second term \(H(t_*Q)\) omitted. Indeed, according to the proof of Lemma 4, that term corresponds to the denominator in (2), which the lower benchmark ignores.
The UMM e-variable and the lower benchmark are not comparable. On the one hand, the lower benchmark is not an exchangeability e-variable in general; it is only an e-variable w.r. to the narrower IID model. This tends to make the lower benchmark larger. On the other hand, the lower benchmark is not admissible under any IID probability measure \(B^N\), in the sense of \(\int \textrm{LB}\,\textrm{d}B^N < 1\), while the UMM e-variable is admissible under any exchangeable probability measure Q, meaning \(\int \textrm{UMM}\,\textrm{d}Q = 1\). This tends to make the UMM e-variable larger.
Remark 3
Notice that the difference between the assumptions of IID and exchangeability, while non-existent in the case of infinite data sequences (by de Finetti’s theorem, every exchangeable probability measure on \(\{0,1\}^{\infty }\) is a mixture of IID probability measures), becomes important for finite data sequences. The difference is quantified in Vovk (1986).
In the rest of this section we will see how to compute efficiently the UMM e-variable, i.e., the likelihood ratio of the UMM alternative Markov kernel (2) to the null Markov kernel. In our derivation we will use the terminology of (Vovk 2005, Section 8.6) (such as “Markov graph”) and consider an arbitrary finite observation space \(\textbf{Z}\) (instead of \(\{0,1\}\), as in the rest of this paper); to avoid trivialities, let us assume \(|\textbf{Z}|>1\). We will also use the following facts (Vovk et al., 2005, Lemmas 8.5 and 8.6), which are versions of standard results in graph theory (the BEST theorem and the Matrix-Tree theorem).
Lemma 5
In any Markov graph \(\sigma\) with the set of vertices V the number of Eulerian paths from the source to the sink equals
where \(T(\sigma )\) is the number of spanning out-trees in the underlying digraph rooted at the source, \(N_{u,v}\) is the number of darts leading from u to v, and \({{\,\textrm{out}\,}}(\cdot )\) is the number of darts leaving a given vertex.
Proof
According to Theorem VI.28 in Tutte (1984) [and using the terminology of (Tutte 1984, Chap. VI)], the number of Eulerian tours in the underlying digraph is
If the source and sink coincide, the number of Eulerian paths is obtained by multiplying this expression by \({{\,\textrm{out}\,}}(\text {source})\). Finally, we erase the identities of different darts going from u to v for each pair of vertices (u, v) by dividing by \(N_{u,v}!\); the resulting expression agrees with (11).
Now suppose the source and sink are different vertices. Create a new digraph by adding another dart leading from the sink to the source. The number of Eulerian paths from the source to the sink in the old digraph will be equal to the number of Eulerian tours in the new graph, i.e.,
where \({{\,\textrm{out}\,}}\) refers to the old digraph. It remains to erase the identities of different darts going from u to v for each pair of vertices (u, v) in the old digraph; the resulting expression again agrees with (11). Alternatively, we can combine the two cases by always adding another dart leading from the sink to the source. \(\square\)
Lemma 6
To find the number \(T(\sigma )\) of spanning out-trees rooted at the source in the underlying digraph of a Markov graph \(\sigma\) with vertices \(z_1,\dots ,z_n\) (\(z_1\) being the source),
-
create the \(n\times n\) matrix with the elements \(a_{i,j}=-N_{z_i,z_j}\);
-
change the diagonal elements so that each column sums to 0;
-
compute the co-factor of \(a_{1,1}\).
Proof
This lemma can be derived from Theorem VI.28 in Tutte (1984). In that theorem we obtain \(T(\sigma )\) by computing the co-factor of any diagonal element \(a_{i,i}\), but that theorem is about Eulerian digraphs. We can make the underlying digraph of our Markov graph Eulerian by connecting the sink to the source. This operation does not affect the number of out-trees rooted at the source and does not change the co-factor of \(a_{1,1}\). \(\square\)
Let us specialize Lemmas 5 and 6 to the binary case \(\textbf{Z}:=\{0,1\}\).
Corollary 7
Let \(\sigma\) be a Markov graph with vertices in \(\{0,1\}\) and with \(F\in \{0,1\}\) as its source. The number of Eulerian paths from the source to the sink equals
where \(N_i:={{\,\textrm{in}\,}}(i)+1_{\{F=i\}}\) (\({{\,\textrm{in}\,}}(i)\) being the number of darts entering i, so that \(N_i\) is the number of i on any Eulerian path) and \(N_{i,j}\) (with the comma often omitted) is the number of darts leading from i to j.
Proof
The case \(N_0\wedge N_1=0\) is obvious, so we will assume \(N_0\wedge N_1>0\). The number of spanning out-trees rooted at the source in the underlying digraph is
this follows from Lemma 6 and is obvious anyway. It remains to plug this in into Lemma 5: if the source F and sink L coincide, \(F=L\), we obtain
for the number of Eulerian paths from the source to the sink, and if \(F\ne L\), we obtain
both expression agree with (12). \(\square\)
Combining (7) and (12), we obtain the total alternative weight (i.e., probability under the alternative hypothesis) of
for all data sequences of a given Markov type \(\sigma\).
Under the null hypothesis the probability of a data sequence of exchangeability type \((N_0,N_1)\) is
and so the likelihood ratio (the alternative over the ECM as the null hypothesis) is
(see (7) and (13)), where the \(\sigma\) in \(\sum _{\sigma }\) ranges over the Markov types \((f,n_{00},\dots ,n_{11},l)\) compatible with the exchangeability type \((N_0,N_1)\). The equality in (14) holds when \(N_0\wedge N_1>0\); in the case \(N_0\wedge N_1=0\) the likelihood ratio is 1 (and we will treat this case separately in Algorithm 1).
The expression (14) (interpreted as 1 when \(N_0\wedge N_1=0\)) is our main object of interest in this paper; remember that we refer to it as the UMM e-variable.
It remains to explain how to compute the second sum \(\sum _{\sigma }\) in (14) (which is twice as large as \(\sum _{\sigma }W(\sigma )\); in particular, it sums to 2 over all exchangeability types). Assume \(N_0\wedge N_1>0\) and remember that \(N\ge 2\). For \(\sigma =(f,n_{00},\dots ,n_{11},l)\) with \(f=l=0\) (which is only possible when \(N_0\ge 2\)), each such addend in the sum is
A specific Markov type \((f,n_{00},\dots ,n_{11},l)\) is determined (once we know that \(f=l=0\)) by \(n_{01}\), and its other components can be found from the equalities
The valid values for \(n_{01}\) are between 1 and \((N_0-1)\wedge N_1\), and so the part of the sum \(\sum _{\sigma }\) corresponding to such \(\sigma\) is
Both sides are well defined since \(N_0\ge 2\).
For \(\sigma\) with \(f=0\) and \(l=1\), the part of the sum \(\sum _{\sigma }\) corresponding to such \(\sigma\) is
For \(\sigma\) with \(f=1\) and \(l=0\), the part of the sum \(\sum _{\sigma }\) corresponding to such \(\sigma\) is
Finally, for \(\sigma\) with \(f=l=1\), the part of the sum \(\sum _{\sigma }\) corresponding to such \(\sigma\) is
Both sides of (18) are well defined since \(N_1\ge 2\).
We can simplify the sum of (15), (16), (17), and (18) as follows. If \(N_0<N_1\), the sum simplifies to
and if \(N_0=N_1\), the sum simplifies to \(2/(N_0+1)\). (There is no need to consider the case \(N_1<N_0\) because of the symmetry between \(N_0\) and \(N_1\).) Therefore, the sum over \(\sigma\) on the right-hand side of (14) is
The overall algorithm is presented as Algorithm 1. The value of the uniformly mixed Markov e-variable \(\textrm{UMM}\) is computed according to (14), and the value \(\textrm{ELB}\) of the exchangeability lower benchmark in line 5 is just (14) with the sum over the Markov types \(\sigma\) omitted. The variable \(\text {Sum}\) is set in lines 6–9 to \(\sum _{\sigma }W(\sigma )\) and computed according to (19). The output is returned by the return command, and the algorithm stops as soon as the first such command is issued.
The computational complexity of Algorithm 1 is clearly optimal (to within a constant factor) both time-wise and memory-wise. Namely, the algorithm requires O(N) steps and O(1) memory.
5 Maximum e-power of the UMM alternative
In this section we will compute the asymptotic efficiency of the UMM e-variable under the UMM alternative. (In the next section, however, we will see the weakness of our notion of efficiency: it has a long-run frequency interpretation, but the logarithm of the UMM e-variable can be extremely volatile, and so its mathematical expectation can be very different from what we actually expect to observe.)
Proposition 8
Under the UMM alternative Q, the asymptotic e-power of the UMM e-variable \(\textrm{UMM}\) (for time horizon N) satisfies
The same expression gives the asymptotic e-power of the exchangeability lower benchmark (and of the lower benchmark).
Proof
Let us compute separately the three components after the “\(=\)” in (6), starting from the last one. When estimating \(-H(Q)\), we need to estimate the frequencies \(N_{00}\), \(N_{01}\), \(N_{10}\), \(N_{11}\) for a Markov chain with transition probabilities \(\pi _{i,j}\). To this end, we define a new Markov chain whose states are the pairs \(z_i z_{i+1}\), \(i=1,\dots ,N-1\), of adjacent states of the old Markov chain with the matrix of transition probabilities
the rows and columns of this matrix are labelled by the states 00, 01, 10, and 11 of the new Markov chain, in this order. The stationary probabilities for this \(4\times 4\) matrix are
Now, assuming that the observations are generated from a Markov chain with transition probabilities \(\pi _{i,j}\), we obtain (cf. (7))
(we are ignoring special cases such as \(N_{00}=0\), which should be considered separately). To find the expectation under the Bayes mixture of the Markov model with the uniform prior on \((\pi _{01},\pi _{10})\), we integrate
Now let us estimate the first term
after the “\(=\)” in (6). Set \(K:=\sigma\) (this is the number of 1 s), and suppose the observations are generated from a Markov chain with given transition probabilities \(\pi _{01}\) and \(\pi _{10}\). We then have
where \(\pi _0\) and \(\pi _1\) are the stationary probabilities
of the Markov chain. It remains to take the integral
The final term \(H(t_*Q)\) in (6) can be ignored. Indeed, using the last expression in (7), we can bound the probability \((t_*Q)(\{K\})\), for any \(K\in \{1,\dots ,N-1\}\), by 1 from above and by \(1/(2N^3)\) from below:
(the expression after the first “\(\ge\)” being the probability of the sequence consisting of K 1 s followed by \(N-K\) 0 s). Therefore, \(H(t_*Q)=O(\ln N)\). (As always, the extreme cases \(K\in \{0,N\}\) should be considered separately.)
Combining (20) and (21), we obtain the coefficient
in front of N in the asymptotic expression for \({{\,\textrm{ep}\,}}_Q(\textrm{UMM})\).
The proof shows that the asymptotic e-power is the same for the exchangeability lower benchmark, and a simple calculation using Stirling’s formula (see, e.g., [Vovk et al. 2022, Proposition 9.2]) shows that we also have the same asymptotic e-power for the lower benchmark. \(\square\)
Proposition 8 states that the e-powers of the UMM e-variable and of the exchangeability lower benchmark are close asymptotically, and its proof gives a crude argument that is still sufficient to demonstrate this. The following corollary of the previous section’s results establishes much more precise relations between the UMM e-variable and the exchangeability lower benchmark.
Corollary 9
It is always true that
Moreover,
Proof
In the case \(N_0\wedge N_1>0\), the relation (25) follows from (19). If \(N_0=0\) or \(N_1=0\), the expression on the right-hand side of (25) becomes 2N, which agrees with the last expression (which simplifies to 1/(2N)) on the right-hand side of the chain (7).
For a fixed sum \(N_0+N_1\), the maximum of the right-hand side of (25) is attained for \(N_0=0\) or \(N_1=0\), and the maximum is 2N. This proves (24). \(\square\)
6 Computational experiments
In this section we will conduct three groups of experiments involving the two lower benchmarks and the UMM exchangeability e-variable. The first group is the main one, and in it the true data distribution is a specific Markov probability measure with the initial probability of 1 equal to 1/2. In this case, we define another benchmark [as in (Vovk et al. 2022, Sect. 9.2.5)],
the upper benchmark, as
(cf. (7)), where \(\pi _0\) and \(\pi _1\) are the stationary probabilities under the true data-generating distribution. We can see that the upper benchmark is an e-variable (likelihood ratio) w.r. to a specific IID probability measure, and so it is not even an IID e-variable. Therefore, we should not be surprised if the upper benchmark exceeds a bona fide exchangeability e-variable; there are two elements of cheating in interpreting the upper benchmark as measure of evidence against the null hypothesis of exchangeability: first, it tests IID rather than exchangeability, and second, it tests only one individual IID measure.
Our results for specific Markov alternatives are given in Fig. 1. This figure contains boxplots for \(K:=10^5\) simulations of four values: the exchangeability lower benchmark \(\textrm{ELB}\) (given by (9)), the lower benchmark \(\textrm{LB}\) (given by (8)), the upper benchmark \(\textrm{UB}\) (given by (26)), and the UMM exchangeability e-variable \(\textrm{UMM}\) (given by Algorithm 1). Only two of these, \(\textrm{ELB}\) and \(\textrm{UMM}\), are bona fide exchangeability e-variables.
The time horizon N and the transition probabilities for the two panels are given in the caption.
In both panel of Fig. 1 we consider symmetric Markov chains, \(\pi _{01}=\pi _{10}\), as alternatives to exchangeability. The observations are generated from those alternative probability measures. In the left panel we consider an “easy” case, \(\pi _{01}=0.1\), in the sense of being easily distinguishable from the case of exchangeability, \(\pi _{01}=0.5\). The case in the right panel, \(\pi _{01}=0.4\), is closer to exchangeability and thus more difficult. To decide which e-values are most interesting in practice I used Jeffrey’s (Jeffreys, 1961, Appendix 2) rule of thumb involving thresholds for e-values between \(10^{1/2}\) and 100. In the easy case, \(N=20\) observations are sufficient for the UMM e-variable to produce typical e-values that are of the same order of magnitude as Jeffreys’s thresholds. In the difficult case, we need more observations for that, and we set \(N:=400\).
UMM performs better than LB in both panels and, of course, better than ELB (we know that UMM dominates ELB). ELB and LB often fail to achieve Jeffreys’s low threshold of \(10^{1/2}\) for substantial evidence against the null hypothesis. It is interesting that \(\textrm{UMM}\) is often even higher than the upper benchmark, as in the right panel of Fig. 1.
Table 1 gives more precise numerical values that can be read off Fig. 1 only very approximately. The bars stand for the empirical averages of the decimal logarithms of ELB, LB, and UMM over the same \(K:=10^5\) simulations as in Fig. 1. The table also gives the difference between the empirical averages of the UMM and ELB and the upper bound for the difference given by (24).
According to Corollary 9, the UMM e-value cannot differ from the exchangeability lower benchmark by much. The upper bound (24) holds and is not excessively loose.
Figure 2 describes the second group of experiments and explores the behaviour of ELB, LB, UB, and UMM under the null hypothesis (as suggested by a referee). In the left panel the probability of 1 is 0.5, and all four are valid e-variables; while UB is not valid under exchangeability in general, it is valid under this particular exchangeable probability measure. The number of observations is \(N=20\). The UMM e-variable performs best in this case. The right panel has 0.1 as the probability of 1, which makes UB (still based on \(\pi _0=\pi _1=0.5\)) very invalid. Among the valid e-variables UMM still performs best.
The third group of experiments involves generating the binary observations from the UMM alternative (which is not Markov any longer). The explicit formula for this alternative is given in (7), but it is easier to generate \(\pi _{01}\) and \(\pi _{10}\) from the uniform distribution on \([0,1]^2\) and then generate the observations from the Markov chain with these parameters. This interpretation of the UMM alternative shows that our algorithm for testing exchangeability is now in a hostile environment: with a sizeable probability we will get \(\pi _{01}\approx \pi _{10}\), i.e., difficult data sequences that look almost exchangeable.
Figure 3 shows results for this case; in the expression (26) for the upper benchmark, we still set \(\pi _{0}:=\pi _{1}:=0.5\). It is striking how spread out the distributions for the three benchmarks and the UMM e-variable are, demonstrating the hostile nature of the testing environment. They are also skewed, with the mean very different from the median. To obtain UMM e-values that are consistently in Jeffreys’s range, now we need much larger values of N, such as \(10^3\), shown in the left panel of Fig. 3. The lack of validity for the upper benchmark is very obvious in Fig. 3: it takes much larger values, and I did not even bother to include the whole boxplots for it.
Table 2, which is analogous to Table 1, gives more precise numbers related to Fig. 3. As before, the bars stand for the empirical averages of the decimal logarithms over \(K=10^5\) replications, and N is the time horizon. Now we also have “as.”, the common theoretical asymptotic value for the UMM e-variable and exchangeability lower benchmark obtained from (23) by dividing by \(\ln 10\) (to convert natural logarithms to decimal ones) and multiplying by the sample size N. As expected, the approximation is least accurate for \(N=10^3\). The table also gives the average differences between the UMM e-variable and exchangeability lower benchmark on the \(\log _{10}\) scale, together with the upper bound given by (24). The upper bound still holds.
7 Conclusion
In this paper the algorithm for computing the UMM e-variable was fully developed only in the binary case. A natural next step would be to extend it to any finite observation space \(\textbf{Z}\). (A big chunk of Sect. 4, following (Vovk et al., 2005, Sect. 8.6), presented the combinatorics for an arbitrary finite observation space \(\textbf{Z}\).) It is interesting what the computational complexity of such an extension of Algorithm 1 will be in general as function of N and \(|\textbf{Z}|\).
The topic of this paper has been testing the exchangeability compression model in the batch mode using Markov alternatives. There are many other interesting null hypotheses among Kolmogorov compression models, and there are many interesting alternatives. For example, in Vovk et al.(2022, Chap. 9) we discussed, alongside Markov alternatives, detecting changepoints. Our discussion there was in the online mode, but for changepoint detection the batch mode is not less important (Vovk et al., 2022, Remark 8.19); e.g., its role has been increasing in bioinformatics (including DNA analysis). Using e-values in changepoint detection is particularly convenient when multiple hypothesis testing is involved (as it often is in batch changepoint detection).
Availability of data and materials
Not applicable (all computational experiments in this paper are computer simulations and do not involve any real-world data).
Code availability
It is available on the webpage http://www.alrw.net/ (under Working Paper 38).
Notes
In this paper, our notation for logarithms is \(\ln\) (natural) and \(\log\) (binary, used only in Appendix 1).
References
Asarin, E. A. (1987). Some properties of Kolmogorov \(\Delta\)-random finite sequences. Theory of Probability and Its Applications, 32, 507–508.
Asarin, E. A. (1988). On some properties of finite objects random in the algorithmic sense. Soviet Mathematics Doklady, 36, 109–112.
Grünwald, P., Heide, R., & Koolen, W.M. (2023). Safe testing. Technical Report arXiv:1906.07801 [math.ST], arXiv.org e-Print archive. Journal version is to appear in the Journal of the Royal Statistical Society B (with discussion).
Grünwald, P. D. (2007). The minimum description length principle. MIT Press.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford University Press.
Kelly, J. L. (1956). A new interpretation of information rate. Bell System Technical Journal, 35, 917–926.
Kolmogorov, A. N. (1968). Logical basis for information theory and probability theory. IEEE Transactions on Information Theory IT, 14, 662–664.
Kolmogorov, A. N. (1983). Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys, 38, 29–40.
Kolmogorov, A. N., & Uspensky, V. A. (1987). Algorithms and randomness. Theory of Probability and Its Applications, 32, 389–412.
Koning, N. W. (2024). Post-hoc and anytime valid inference for exchangeability and group invariance. Technical Report arXiv:2310.01153 [math.ST], arXiv.org e-Print archive.
Lauritzen, S. L. (1988). Extremal families and systems of sufficient statistics. Springer.
Lehmann, E. L. (2006). Nonparametrics: Statistical methods based on ranks (1st ed.). Springer.
Lindley, D. V. (2006). Understanding Uncertainty. Wiley.
Martin-Löf, P. (1966). The definition of random sequences. Information and Control, 9, 602–619.
Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36, 97–131.
Novikov, G. (2016). Relations between randomness deficiencies. Technical Report arXiv:1608.08246 [math.LO], arXiv.org e-Print archive. Published in Lecture Notes in Computer Science 10307:338–350 (2017).
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (Eds.). (2009). Dataset shift in machine learning. MIT Press.
Ramdas, A., Ruf, J., Larsson, M., & Koolen, W. M. (2022). Testing exchangeability: Fork-convexity, supermartingales and e-processes. International Journal of Approximate Reasoning, 141, 83–109.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11, 416–431.
Semenov, A., Shen, A., & Vereshchagin, N. (2024). Kolmogorov’s last discovery? (Kolmogorov and algorithmic statistics). Theory of Probability and Its Applications, 68, 582–606.
Shafer, G., & Vovk, V. (2019). Game-theoretic foundations for probability and finance. Wiley.
Shen, A., Uspensky, V. A., & Vereshchagin, N. (2017). Kolmogorov complexity and algorithmic randomness. American Mathematical Society.
Takeuchi, J., Kawabata, T., & Barron, A. R. (2013). Properties of Jeffreys mixture for Markov sources. IEEE Transactions on Information Theory, 59, 438–457.
Tutte, W. T. (1984). Graph theory. Addison-Wesley.
Uspensky, V. A., & Semenov, A. L. (1993). Algorithms: Main ideas and applications. Kluwer.
Vapnik, V. N. (1998). Statistical learning theory. Wiley.
Vovk, V. (1986). On the concept of the Bernoulli property. Russian Mathematical Surveys, 41, 247–248. Another English translation with proofs: arXiv:1612.08859 (2016).
Vovk, V. (2001). Kolmogorov’s complexity conception of probability. In V. F. Hendricks, S. A. Pedersen, & K. F. Jørgensen (Eds.), Probability theory: Philosophy, recent history and relations to science (pp. 51–69). Kluwer.
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world (1st ed.). Springer. Section 8.6 of the first edition is not part of the second edition.
Vovk, V., Gammerman, A., & Shafer, G. (2022). Algorithmic learning in a random world (2nd ed.). Springer.
Vovk, V., & Shafer, G. (2003). Kolmogorov’s contributions to the foundations of probability. Problems of Information Transmission, 39, 21–31.
Vovk, V., & Shafer, G. (2023). A conversation with A. Philip Dawid. Statistical Science. Accepted for publication.
Vovk, V., & V’yugin, V. V. (1993). On the empirical validity of the Bayesian method. Journal of the Royal Statistical Society B, 55, 253–266.
Vovk, V., & Wang, R. (2021). E-values: Calibration, combination, and applications. Annals of Statistics, 49, 1736–1754.
V’yugin, V.V. (2019). Kolmogorov complexity in the USSR (1975–1982): Isolation and its end. Technical Report arXiv:1907.05056 [cs.GL], arXiv.org e-Print archive.
Wald, A., & Wolfowitz, J. (1943). An exact test for randomness in the non-parametric case based on serial correlation. Annals of Mathematical Statistics, 14, 378–388.
Acknowledgements
Many thanks to Wouter Koolen and the anonymous reviewers for useful comments and corrections. Research on this paper has been partially supported by Mitie.
Funding
This research has been supported by Mitie.
Author information
Authors and Affiliations
Contributions
Vladimir Vovk is the sole author; he conceived the idea and wrote it up.
Corresponding author
Ethics declarations
Conflict of interest
The author has no Conflict of interest to declare.
Ethics approval
Not applicable.
Consent to participate
Vladimir Vovk (sole author) gives his consent to participate.
Consent for publication
Vladimir Vovk (sole author) gives his consent to publication.
Additional information
Editors: Henrik Boström, Eyke Hüllermeier, Ulf Johansson, Khuong An Nguyen, Aaditya Ramdas.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Algorithmic theory of randomness
The topic of this appendix is Kolmogorov’s original approach to compression modelling. While in the main part of the paper we avoided using computability theory, here it will play an important role.
Kolmogorov’s complexity models were introduced, in their most complete form, in what appears to be Kolmogorov’s last talk. It was given on 14 October 1982 at what later became known as the Kolmogorov seminar; see (Semenov et al., 2024, note 12), containing Shen’s notes taken during the talk, and (Vovk, 2001, Sect. 4). The Kolmogorov seminar at Moscow State University was opened by Kolmogorov on 28 October 1981, and Kolmogorov gave two talks in it, on 26 November 1981 and 14 October 1982 (Semenov et al., 2024, note 12); the two talks were conflated in my paper (Vovk, 2001, Sect. 4).
All results listed in this appendix are either well known or immediately follow from well-known results.
1.1 Mathematical results
In this appendix, we assume a fixed sufficiently large aggregate X of constructive objects in the sense of Uspensky and Semenov (1993, Sect. 1.0.6). In particular, X contains the integers, the finite binary sequences, and the finite sets of those. Let us use the notation C(x) for the Kolmogorov complexity of x, \(C(x\mid y)\) for the conditional Kolmogorov complexity of x given y, K(x) for the prefix complexity of x, and \(K(x\mid y)\) for the conditional prefix complexity of x given y. Here x and y are any constructive objects from X. See, e.g., Shen et al. 2017, Chaps 1, 2, and 4] for definitions.
Kolmogorov’s definition of randomness deficiency of an element x of a finite set \(A\subset X\) is
(Kolmogorov & Uspensky, 1987, Sect. 2.3).
Informally, x is random in A if \(d_A^C(x)\) is small. (And Kolmogorov called x \(\Delta\)-random in A if \(d_A^C(x)\le \Delta\).)
Martin-Löf (Martin-Löf, 1966) showed that Kolmogorov’s definition (27) can be stated in terms of p-values. Let A be a finite non-empty subset of X; remember that \(U_A\) is the uniform probability measure on A. A function \(f:A\rightarrow [0,1]\) is a p-variable if
A family P of functions \(P_A:A\rightarrow [0,1]\), A ranging over the finite non-empty subsets of X, is a p-test if
-
the function \((A,x)\mapsto P_A(x)\) is upper semicomputable, i.e., there is an algorithm that eventually stops on input \((A,x,\epsilon )\), where \(\epsilon\) is a rational number, if and only if \(P_A(x)<\epsilon\), and
-
for each finite non-empty \(A\subset X\), \(P_A\) is a p-variable.
The values taken by p-variables are p-values.
Lemma 10
There exists a universal p-test \({\tilde{P}}\), in the sense that for any p-test P there exists a positive constant c such that \({\tilde{P}} \le c P\).
The proof of Lemma 10 is standard (cf., e.g., Shen et al. 2017, Theorem 39). Fix a universal p-test \({\tilde{P}}\). The universal p-test is unique to within a constant factor, and it is customary in the algorithmic theory of randomness to disregard such differences, which we will also do in this appendix.
Remark 4
The usual definitions in the algorithmic theory of randomness are given in terms of \(-\log P\), but for simplicity let us discard the minus logarithm, following (Vovk & V’yugin, 1993).
Now we can state Martin-Löf’s result expressing Kolmogorov’s deficiency of randomness via the universal p-test.
Proposition 11
There exists a constant \(c>0\) such that, for all A and \(x\in A\),
Proof
Martin-Löf states and proves a slightly less general result in Martin-löf (1966, Sect. II, Theorem on p. 607) (see also Martin-löf, 1966, Sect. V, Theorem on p. 616), but his argument is general. Since, for each finite set \(A\subset X\) and each \(n\in \{0,1,\dots \}\), we have
we will also have
which implies the part
of (28).
To prove the other part of (28), i.e.,
it suffices to establish that, for some c (perhaps a different one),
A ranging over the finite non-empty subsects of X. The last inequality (with \(c:=0\)) follows immediately from the definition of a p-test. \(\square\)
Prefix complexity K has important technical advantages over C (see, e.g., Shen et al. 2017, Chap. 4), and so a natural modification of (27) is
Analogously to expressing (27) in terms of p-values, we can express (29) in terms of e-values.
A function \(f:A\rightarrow [0,\infty )\) on a finite non-empty set \(A\subset X\) is an e-variable if
A family E of functions \(E_A:A\rightarrow [0,1]\), A ranging over the finite non-empty subsets of X, is an e-test if
-
the function \((A,x)\mapsto E_A(x)\) is lower semicomputable, i.e., there is an algorithm that eventually stops on input (A, x, t), where t is a rational number, if and only if \(E_A(x)>t\), and
-
for each finite non-empty \(A\subset X\), \(E_A\) is an e-variable.
Lemma 12
There exists a universal e-test \({\tilde{E}}\), in the sense that for any e-test E there exists a positive constant c such that \({\tilde{E}} \ge E/c\).
The proof of Lemma 12 is again standard (but Shen et al., 2017, Theorem 47, is now more relevant). Fix a universal e-test \({\tilde{E}}\). It is clear that the universal e-test is unique to within a constant factor.
Notice the difference between the universal tests in Lemma 10 and Lemma 12: whereas in the former “universal” means “smallest” (to within a constant factor), in the latter “universal” means “largest”. The following result expresses the prefix version (29) of deficiency of randomness via the universal e-test.
Proposition 13
There exists a constant \(c>0\) such that, for all A and x,
Proposition 13 will follow from two other propositions (Propositions 15 and 16 below), which, despite their simplicity (especially Proposition 16), are of great independent interest.
A function \(f:A\rightarrow [0,1]\) on a finite non-empty set \(A\subset X\) is a subprobability measure (or semimeasure (Shen et al., 2017, Sect. 4.1)) if
A family m of functions \(m_A:A\rightarrow [0,1]\), A ranging over the finite non-empty subsets of X, is a lower semicomputable subprobability measure if
-
the function \((A,x)\mapsto m_A(x)\) is lower semicomputable, and
-
for each finite non-empty \(A\subset X\), \(m_A\) is a subprobability measure.
Lemma 14
There exists a universal lower semicomputable subprobability measure \({\tilde{m}}\), in the sense that for any lower semicomputable subprobability measure m there exists a positive constant c such that \({\tilde{m}} \ge m/c\).
For a proof of, essentially, Lemma 14, see the proof of Shen et al. (2017, Theorem 47). Let us abbreviate “universal lower semicomputable subprobability measure” to universal measure.
Proposition 15
There exists a constant \(c>0\) such that, for all A and x,
Proof
Follow (Shen et al., 2017, Sect. 4.5). \(\square\)
Proposition 16
There exists a constant \(c>0\) such that, for all A and x,
Proof
It suffices to notice that \({\tilde{m}}_A(x)|A|\) is an e-test and that \({\tilde{E}}_A(x)/|A|\) is a lower semicomputable subprobability measure. \(\square\)
The interpretation of (31) is that the universal e-test \({\tilde{E}}\) is a likelihood ratio: we divide the universal measure \({\tilde{m}}\) (“universal alternative hypothesis”) by the null uniform probability measure, assigning weight \(1/|A|\) to each \(x\in A\).
Now we can easily prove Proposition 13.
Proof of Proposition 13
Combining the previous propositions, we obtain
i.e., (30). The first equality in (32) just uses the definition of \(d^K_A(x)\), and the inequality “\(\le\)” in (32) is obtained by applying Proposition 15 to \(K(x\mid A)\) and applying Proposition 16 to \({\tilde{E}}_A(x)\). \(\square\)
Both complexities C and K and randomness deficiencies \(d^C\) and \(d^K\) are close to each other.
Proposition 17
There is a constant \(c>0\) such that, for all finite non-empty \(A\subset X\) and all \(x\in A\),
and
Proof
See Shen et al.(2017, Theorem 65) for inequalities stronger than (33). For (34), follow the proof of (Novikov 2016, Proposition 1). \(\square\)
1.2 Discussion
Kolmogorov’s original definition of randomness deficiency of an element of a finite set is (27). It can be interpreted as the universal p-value on the logarithmic scale (Proposition 11). A natural modification of Kolmogorov’s definition is (29), given in terms of prefix complexity, and it can be interpreted as the universal e-value on the logarithmic scale (Proposition 13).
The simplest context in which these definitions can be used is that of complexity models, in the terminology of Vovk (2001); Vovk and Shafer (2003). A complexity model is a computable partition of the sample space, and the implicit statement about the observed data sequence x is that it is random in the sense of (27) (or (29), which is close to (27) by Proposition 17) in the block \(A\ni x\) of the partition. Let me give several examples of such models, those that are most relevant in the context of this paper. The sample space in all these examples will be \(\{0,1\}^*\).
-
The main complexity model of interest to Kolmogorov (1968, 1983) was that of exchangeability, where the binary sequences \(\{0,1\}^*\) are divided into the blocks of sequences of the same length and with the same number of 1 s. Stripping this complexity model of the algorithmic theory of randomness, we obtain the exchangeability compression model introduced in the main part of the paper (Sect. 2).
-
Another complexity model (Kolmogorov, 1983) is the Markov model, in which the blocks consist of the binary sequences with the identical first element and the same number of substrings 00, 01, 10, and 11. In the terminology of Vovk et al. (2022, Sect. 11.3.4), the exchangeability model is more specific than the Markov model.
-
A further generalization of the exchangeability complexity model is the second order Markov model (suggested in Kolmogorov’s 1982 seminar talk (Vovk, 2001)), in which the blocks consist of the binary sequences with the identical first and second elements and the same number of substrings 000, 001, 010, 011, 100, 101, 110, and 111.
-
A model not considered by Kolmogorov is the changepoint model (exchangeability with a changepoint), in which the blocks are indexed by \((N,\tau ,K_0,K_1)\), where \(N\in \{2,3,\dots \}\) (the time horizon), \(\tau \in \{1,\dots ,N-1\}\) (the changepoint), \(K_0\in \{0,\dots ,\tau \}\), and \(K_1\in \{0,\dots ,N-\tau \}\), and the block \((N,\tau ,K_0,K_1)\) consists of all binary sequences of length N with \(K_0\) 1 s among their first \(\tau\) elements and \(K_1\) 1 s among their last \(N-\tau\) elements.
Other complexity models introduced by Kolmogorov were the Gaussian and Poisson models (in his 1982 seminar talk (Semenov et al., 2024, note 12); see also Asarin (1987, 1988) and Vovk (2001, Sect. 4). A complexity model formalizing the property of being IID rather than exchangeability was introduced in work (Vovk, 1986) done under Kolmogorov’s supervision.
1.3 Stochastic sequences
Kolmogorov’s 1981 seminar talk was devoted to what he called stochastic sequences, which can be interpreted as an overarching structure over complexity models.
Let us say that a binary data sequence \(x\in X\) is \((\alpha ,\beta )\)-stochastic if there is a finite set \(A\subset X\) containing x such that \(C(A)\le \alpha\) and \(d_A^C(x)\le \beta\). And let us say that \(x\in X\) is \(\Delta\)-random w.r. to a complexity model if \(d_A^C(x)\le \Delta\), where A is the block of the complexity model containing x. Data sequences that are modelled using complexity models are stochastic; e.g., for some constant c:
-
if a data sequence of length N is \(\Delta\)-exchangeable (i.e., \(\Delta\)-random w.r. to the exchangeability model), it is \((2\log N+c,\Delta +c)\)-stochastic;
-
if a data sequence of length N is \(\Delta\)-Markov (i.e., \(\Delta\)-random w.r. to the Markov model), it is \((4\log N+c,\Delta +c)\)-stochastic;
-
if a data sequence of length N is \(\Delta\)-Markov of second order, it is \((8\log N+c,\Delta +c)\)-stochastic;
-
if a data sequence of length N is \(\Delta\)-random w.r. to the IID model introduced in Vovk (1986), it is \((\frac{3}{2}\log N+c,\Delta +c)\)-stochastic;
-
if a data sequence of length N is \(\Delta\)-exchangeable with one change point (i.e., \(\Delta\)-random w.r. to the changepoint model), it is \((4\log N+c,\Delta +c)\)-stochastic.
Appendix 2: Quasi-universal e-variables
In this paper we are interested, at least implicitly, in the universal e-test \({\tilde{E}}\) introduced in Lemma 12. It is a fundamental object in that its components \({\tilde{E}}_A\) are the largest e-variables; in this sense they are the most powerful e-variables. By Proposition 16, \({\tilde{E}}_A\) is the likelihood ratio of the universal measure to the null hypothesis \(U_A\). In the main part of the paper we discussed a specific alternative hypothesis (namely, UMM), and the universal measure can be regarded as the universal alternative.
The way the universal measure is constructed in the algorithmic theory of randomness is by averaging over all subprobability measures that are computable in a generalized sense (see, e.g., Shen et al. 2017, Theorem 47, the alternative proof).
The algorithmic theory of randomness, however, provides only an ideal picture. It can serve as a model for more practical approaches, but it is not practical itself. The two most conspicuous reasons are that:
-
the basic quantities used in the algorithmic theory of randomness, such as complexity or randomness deficiency, are not computable (they are only computable in a generalized sense, let alone efficiently computable); in particular, the universal alternative is not computable;
-
these basic quantities are only defined to within a constant (additive or multiplicative).
What we did in the main part of this paper can, however, be regarded as a computable approximation to the ideal picture. The idea (which is an old one; see the references below) is to replace the universal alternative by a Bayesian average of a statistical model that is significantly richer than the null hypothesis. In particular, the UMM exchangeability e-variable discussed in the main part of this paper can be regarded as a practical approximation to the universal e-test \({\tilde{E}}\).
The justification that we had for the UMM e-variable is less convincing than the justification for its ideal counterpart \({\tilde{E}}\): it is the frequentist one given by Lemma 3 and assuming that the observed data sequence is generated by the UMM alternative. Its advantage, however, is that this justification does not involve an arbitrary constant factor.
It would be more in the spirit of the algorithmic theory of randomness to use a different principle for choosing the alternative hypothesis: instead of choosing an alternative probability measure likely to generate the data, we could choose an alternative probability measure likely to lead to a high likelihood ratio of the alternative to the null.
The general scheme of testing exemplified by this paper is that we test a Kolmogorov compression model as null hypothesis, and have a batch compression model with a more detailed summarising statistic as alternative. This paper has the exchangeability compression model as the null and a mixture of the first-order Markov model as the alternative. We can imagine lots of other testing problems of this kind:
-
The exchangeability model as the null, and the uniform mixture of the second-order Markov model as the alternative.
-
The exchangeability model as the null, and a mixture of the uniform mixtures of the kth order Markov models as the alternative; the weights \(w_k\) for those should sum to 1, \(\sum _k w_k = 1\), and tend to 0 as slowly as possible as \(k\rightarrow \infty\) (see below).
-
The first-order Markov model as the null, and a mixture of the second-order Markov model as the alternative.
-
The exchangeability model as the null and the changepoint model as alternative.
-
A changepoint at a postulated time \(\tau\) as the null, and a mixture of probability measures corresponding to a changepoint at a different time as alternative. (In order to obtain confidence regions for the changepoint.)
We can call them instances of quasi-universal testing.
In information theory and statistics, quasi-universal prediction and coding (similar to quasi-universal testing discussed here) was promoted by Rissanen; see, e.g., Rissanen (1983) and Grünwald’s review (Grünwald, 2007). Rissanen’s suggestion for the weights \(w_k\), \(k=1,2,\dots\), that sum to 1 and tend to 0 slowly was
where the denominator includes all terms that exceed 1 and \(c\approx 0.865\) is the normalizing constant (Rissanen, 1983, Appendix 1).
In this paper we used the uniform prior on the Markov statistical model to obtain our alternative hypothesis. Another natural choice is Jeffreys priors (Jeffreys, 1961). They have some advantages, to be discussed in the next paragraph, but their advantages in our current context are much less pronounced than in other contexts, where they, e.g., are invariant w.r. to smooth reparametrizations (Jeffreys, 1961) and attain minimax optimality in some cases (Grünwald, 2007, Sect. 8.2) (perhaps after modifications). They do not always exist, and many Bayesian statisticians find them objectionable (see, e.g., Vovk & Shafer, 2023, Sect. 6). Using the uniform prior in this paper leads to simple analytical expressions and efficient calculations.
A typical advantage of Jeffreys priors over uniform priors is that they assign larger weights to extreme values of parameters. Let us discuss, for simplicity, the priors considered in Ramdas et al. (2022): \(\pi _{01}\) and \(\pi _{10}\) are generated independently from Jeffreys’s probability density
(where \(\pi \approx 3.14\) is the standard mathematical constant, not to be confused with \(\pi _{i,j}\) and not used outside of this and next paragraphs). These priors are built on top of Jeffreys priors but are not Jeffreys priors themselves (Takeuchi et al., 2013, Sect. 1). They are used in Ramdas et al. (2022) for tackling problems that are similar to ours (using the Markov model as alternative when testing exchangeability).
The density f in (36) dominates the uniform density (if we ignore the constant factor \(\pi\)), and it can be much larger than the uniform density at the ends \(\theta \approx 0\) and \(\theta \approx 1\) of the interval [0, 1]. This is a step towards quasi-universality, but the step is small: we can easily go further and consider, e.g., the beta distribution with density
for a small \(\alpha>0\); this would not even complicate calculations. An even better choice would be in the direction of (35), which was an improvement on \(w_k\propto k^{\alpha -1}\), but this would complicate calculations enormously. A natural next step would be to assign small but positive point masses to \(\theta =0\) and \(\theta =1\).
Using the uniform prior reflects an implicit assumption that we are making in this paper: all four probabilities \(\pi _{i,j}\), \(i,j\in \{0,1\}\), are middling ones (not too close to 0 or 1).
The idea of quasi-universal testing is closely related to Lindley’s “Cromwell’s rule” (see, e.g., Lindley, 2006, Sect. 6.8). A possible interpretation of Cromwell’s rule in our context is that, when designing a suitable e-variable, we should think of all kinds of alternative models (say, Markov models of all orders), and then mix all of them. Cromwell’s rule as stated by Lindley is very general and encompasses two aspects: our statistical models should be as wide as possible, and our priors should be diffuse (at least non-zero).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Vovk, V. Testing exchangeability in the batch mode with e-values and Markov alternatives. Mach Learn 114, 99 (2025). https://doi.org/10.1007/s10994-024-06720-x
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-024-06720-x





