1 Introduction

Learning Using Statistical Invariants (LUSI) is proposed to be the new paradigm of intelligence-driven learning. Incorporating domain knowledge into the model itself, through the use of invariants, presents a powerful approach to achieve interpretability. This approach not only reduces the reliance on large amounts of data, but also by injecting knowledge, as evidenced by Rueden et al. (2023), demonstrably improves the models’ performance on out-of-distribution data, a crucial factor in high-stakes domains. The concept of invariants also holds significant promise for domain adaptation, where a model must effectively transfer knowledge to a new target domain (Zhao et al., 2019). Moreover, Mroueh et al. (2015) has shown that the use of invariants can significantly reduce the amount of data required to train a model. This is particularly advantageous in scenarios where data collection is expensive or resource-intensive.

Despite the promising direction the LUSI is heading, we have seen very limited empirical success of the LUSI. It is valuable to carefully investigate the underlying mechanism of LUSI and the statistical invariants, so we will have a better understanding of the strength and limitations of the methods. In particular, we are interested in examining scenarios where statistical invariants work well and scenarios where statistical invariants do not.

In the paper, we carefully study the formulation of LUSI, both the LUSI loss function and the LUSI-SVM method. We show that while the current formulation of LUSI can work effectively in some scenarios, it fails in achieving its design goal of presenting an intelligence-driven paradigm. We also show empirical evidence via numerical experiments that highlights how we can exploit formulation of LUSI to construct effective predicates, and also showcase the problem of LUSI formulation that hurts the model performance.

2 Preliminary

We briefly introduce the generic learning framework of Learning using Statistical Invariants (LUSI), please refer to Vapnik and Izmailov (2019) and Vapnik (2019) for details.

In LUSI, the invariant information is encoded in the so-called predicates,Footnote 1 which are a vector \(\Phi\) of length N, where N is the number of observations. The predicates \(\Phi\) are designed to exploit the weak convergence of the learning process in the Hilbert space. In practice, we formulate predicates for data pairs \(\{(x_i, y_i)\}\) as

$$\begin{aligned} \sum _i \Phi _j (x_i) f(x_i) = \sum _i \Phi _j(x_i) y_i. \end{aligned}$$

With a little abuse of notation, we use \(\Phi _j\) for both the function and the vector of the function values. This should be clear from the context. Often, we can combine m predicates by simply summing them up to construct a \(\mathcal {P}\) matrix:

$$\begin{aligned} \mathcal {P}= \sum _j^m \Phi _j \Phi _j^T. \end{aligned}$$

Thus, the LUSI learning framework aims to minimize the loss function

$$\begin{aligned} \mathcal {L}(X, Y) = (f(X) - Y)^T ( \gamma \mathcal {V}+ \tau \mathcal {P}) (f(X) - Y). \end{aligned}$$

where we choose \(\mathcal {V}\) as part of the model specification process. In scenarios where we do not have prior beliefs about the model, we can simply use I instead of \(\mathcal {V}\).

It is common practice that we would use the kernel method with LUSI loss function. That is, we choose \(\hat{Y} = A^TK\) for the predictions, where K is the kernel. Hence, our loss function is

$$\begin{aligned} \mathcal {L}(X, Y) = (KA - Y)^T ( \gamma I + \tau \mathcal {P}) (KA - Y). \end{aligned}$$

Solving this analytically, we have

$$\begin{aligned} A = (( \gamma I + \tau \mathcal {P}) K+ \eta I_n)^{-1}( \gamma I + \tau \mathcal {P}) Y. \end{aligned}$$
(1)

where \(\eta\) is the Lagrangian multiplier used to solve the constraint optimization, which also serves the purpose of regularization, since it is clear that \(( \gamma \mathcal {V}+ \tau \mathcal {P})\) would be canceled unless \(\eta > 0\). Without loss of generality, in the rest of this paper, we omit the bias term in our analysis because it is equivalent to adding a constant column to our data or assuming that data are centered.

3 Main results

3.1 Loss function

We show that the loss function presented in the LUSI formulation is equivalent to a weighted sum of square loss. Hence, they cannot provide structural information, such as symmetries, to the model.

Theorem 1

Suppose f is a function of a learning task from the data \((X,Y) = \{(x_i, y_i)\}\) using predicates \(\mathcal {P}\) constructed as

$$\begin{aligned} \mathcal {P} = \frac{1}{m}\sum _{j=1}^m \Phi _j\Phi _j^T. \end{aligned}$$
(2)

The loss function

$$\begin{aligned} \mathcal {L} = (Y-f(X))^T(\gamma I + \tau \mathcal {P})(Y-f(X)) \end{aligned}$$
(3)

is equivalent to a weighted squared loss in the rotated space via the eigenmatrix of \(\mathcal {P}\). Furthermore,

$$\begin{aligned} \mathcal {L} = \gamma \sum _i (y_i - f(x_i))^2 + \tau \sum _j^m \left( \sum _i \Phi _j(x_i) (y_i - f(x_i)) \right) ^2. \end{aligned}$$

Proof

$$\begin{aligned} \mathcal {L}&= (Y-f(X))^T(\gamma I+\tau \mathcal {P})(Y-f(X)) \nonumber \\&= (Y-f(X))^T(\gamma I+ Q^T \tilde{\Lambda } Q)(Y-f(X)) \end{aligned}$$
(4)
$$\begin{aligned}&= (Y-f(X))^T(Q^T \Lambda Q)(Y-f(X)) \nonumber \\&= (Q(Y-f(X))^T\Lambda Q(Y-f(X)) \end{aligned}$$
(5)

where Eq. (4) is a spectral decomposition of \(\tau \mathcal {P}\) (since \(\mathcal {P}\) is symmetric by construction). Furthermore, by construction of \(\mathcal {P}\),

$$\begin{aligned} \mathcal {L}&= (Y-f(X))^T(\gamma I+\tau \mathcal {P})(Y-f(X)) \nonumber \\&= \gamma \sum _i (y_i - f(x_i))^2 + \tau \sum _j^m \left( \sum _i \Phi _j(x_i) (y_i - f(x_i)) \right) ^2 \end{aligned}$$
(6)

\(\square\)

Theorem 1 shows that the loss function can be considered as regularized sum of squared loss, with the regularizer being the sum of squared weighted loss. We will show an exaqmple of exploiting this structure of the loss function in Sect. 6.1.

Remark 1

After rotation, there are only \(m+1\) unique weights for the loss.

Remark 2

Due to the construction, it can be cumbersome to apply weights to a certain set of observations.

3.1.1 Limitations of the predicates in loss function

As suggested in Vapnik (2019), the goals of predicates are fundamentally different from feature engineering, as they aim to shrink, not increase, the size of the set of admissible functions. However, since all predicates are constructed following weak convergence constraints between the function (the model) and the label, all models with output that approximate the label well will automatically satisfy the weak convergence constraints with certain tolerance.

For any well-trained model, i.e. obtains minimal predication error, we can assume that \(|f(x_i)-y_i| < \epsilon / N\), then

$$\begin{aligned} \sum _i^N \Phi (x_i) (f(x_i)- y_i)&\le \sum _i^N \Phi (x_i) |f(x_i)- y_i| \nonumber \\&< \epsilon \sum _i^N \Phi (x_i). \end{aligned}$$
(7)

Since \(\Phi\) is predetermined, we can only reduce Eq. (7) by reducing \(\epsilon\), i.e. prediction error of the function f. Hence, these predicates will not help with selecting admissible functions.

Based on Eq. (6), we can apply different weights for different groups using a predicate.

$$\begin{aligned} \Phi (x) = {\left\{ \begin{array}{ll} w, & x \in X^{(1)}\\ 0, & x \in X\setminus X^{(1)} \end{array}\right. } \end{aligned}$$
(8)

where w is some scalar constant and \(X^{(1)} \subsetneqq X\). We will show a simple experiment with this method in Sect. 6.1. More sophiscated predicates might be constructed by exploiting Eq. (5). Ideally, we can choose predicates \(\Phi\), whose eigenvectors rotate the error matrix so appropriate weights can be applied. However, this can be challenging or infeasible in practice.

3.2 Kernel ridge regression

An example of LUSI methods that worths special attention is LUSI-SVM, which is a modified version of kernel ridge regression that incorporates predicates. We show that after some linear algebra, the LUSI solution is nothing but a classical kernel ridge regression solution with a modified kernel. Modifications to the kernel in SVM have been studied in Amari and Wu (1999) and Crammer et al. (2002). Hence, the key to understanding the behavior of LUSI lies in understanding the modification of the kernel.

Proposition 2

LUSI KRR solution Eq. (1) with identity matrix, i.e.

$$\begin{aligned} A = (( \gamma I_n + \tau \mathcal {P}) K+ \eta I_n)^{-1}( \gamma I_n + \tau \mathcal {P}) Y. \end{aligned}$$
(9)

is equivalent to the KRR solution with a modification of the low-rank matrix. Specifically, the modification term is of rank m, the number of predicates.

Proof

We can simplify the solution as

$$\begin{aligned} A&= (( \gamma I_n + \tau \mathcal {P}) K+ \eta I_n)^{-1}( \gamma I_n + \tau \mathcal {P}) Y\\&= (K+ \eta ( \gamma I_n + \tau \mathcal {P})^{-1})^{-1} Y.\\ \end{aligned}$$

For simplicity, we can absorb \(\tau\) into \(\mathcal {P}\), as we can scale the predicates during construction. Since \(\mathcal {P}\) is symmetric and positive definite by construction, we can decompose \(\mathcal {P}= \Psi \Psi ^T\) via spectral decomposition. Thus, we have

$$\begin{aligned} \left( \gamma I_n + \mathcal {P}^{T} \right) ^{-1}&= \left( \gamma I_n + \Psi \Psi ^{T} \right) ^{-1} \\&= \frac{1}{\gamma } I_n - \frac{1}{\gamma ^2} \Psi \left( I_m + \frac{1}{\gamma } \Psi ^{T} \Psi \right) ^{-1} \Psi ^{T}, \end{aligned}$$

where the second equality is due to the Kailath variant of the Woodbury identity (Petersen & Pedersen, 2012). Plugging it back into the LUSI solution of KRR, we have the modified kernel:

$$\begin{aligned} K^\prime = K - \frac{\eta }{\gamma ^2} \Psi \left( I_m + \frac{1}{\gamma } \Psi ^{T} \Psi \right) ^{-1} \Psi ^{T} \end{aligned}$$
(10)

The solution then can be written as

$$\begin{aligned} A = \Big (K^\prime + \frac{\eta }{\gamma } I_n\Big )^{-1} Y. \end{aligned}$$

\(\square\)

Remark 3

The formulation of LUSI actually provides a mechanism to modify the initial kernel. However, we notice that the modification term is constructed in the form of a sum of outer products. This in a way defeats the purpose of kernel method, as it relies on the practitioners to provide proper feature expression explicitly, rather than constructing kernel modification via kernel functions.

As we have shown, the modification term is of rank m, which is a low rank considering m to be typically much smaller than the sample size. For the modification term to be effective, we may want the modification term to be full (higher) rank. By modifying the coefficient of the identity matrix, we are able to make the modification term full-rank:

$$\begin{aligned} A&= \left( (\gamma I_n + \tau \mathcal {P})K + \eta I_n \right) ^{-1} (\gamma I_n + \tau \mathcal {P})Y \nonumber \\&= \left( K + \eta (\gamma I_n + \tau \mathcal {P})^{-1} \right) ^{-1} Y\nonumber \\&= \left( K + \eta (\frac{\gamma }{2} I_n + \frac{\gamma }{2} I_n + \tau \mathcal {P})^{-1} \right) ^{-1} Y \end{aligned}$$
(11)
$$\begin{aligned}&= \left( K + \eta (\frac{\gamma }{2} I_n + \Psi \Psi ^{T} )^{-1} \right) ^{-1} Y \nonumber \\&= \left( K + \frac{2\eta }{\gamma } I_n - \frac{2\eta }{\gamma ^2} \Psi \left( I_m + \frac{2}{\gamma } \Psi ^{T} \Psi \right) ^{-1} \Psi ^{T} \right) ^{-1} Y\nonumber \\&= \left( K - \frac{2\eta }{\gamma ^2} \Psi \left( I_m + \frac{2}{\gamma } \Psi ^{T} \Psi \right) ^{-1} \Psi ^{T} \ + \frac{2\eta }{\gamma } I_n \right) ^{-1} Y, \end{aligned}$$
(12)

In Eq. (11), we divide the identity matrix in half and in Eq. (12), we decompose \(\frac{\gamma }{2} I_n + \tau \mathcal {P}\) into \(\Psi \Psi ^T\). However, this boost in rank of the modifications is less than desirable: 1) the seemingly richer modification of the kernel is at the cost of increased regularization; 2) the modification term is still of identity plus low rank perturbation form, which is still not rich in approximation capacity. In Sect. 6.3, we show another significant limitation of the impact the predicates can have.

3.2.1 Special case: single predicate

Studying this special case of using one predicate can serve as an intuitive example of why the LUSI formulation can often reduce the performance of the model. We also illustrate this with numerical experiments in the later section.

Since we only have one predicates, the \(\mathcal {P}\) degenerate into a rank-1 matrix, and \(\Psi = \Phi\), the kernel modification term becomes

$$\begin{aligned} {-} \left( 1 + \frac{1}{\gamma } \Phi ^{T} \Phi \right) ^{-1}\frac{1}{\gamma ^2} \Phi \Phi ^{T}. \end{aligned}$$
(13)

This is simply a scaled version of \(\Phi \Phi ^T\). At first glance, it is not hard to see that any feature that we used in the construction of \(\Phi\), \(\Phi \Phi ^T\) does not possess the desirable properties of a kernel. Indeed, for any two observations such that \(\Phi _i, \Phi _j\) are small in numerical values, \(\Phi \Phi ^T_{i,j}\) will also be a small value. This contradicts the typical expectation of kernel behavior, which resembles the covariance matrix, i.e. similar features should have high response.

It is clear that the effectiveness of LUSI method lies in the modification of the kernel using predicates. Due to the construction, there is no nonlinearity for the increments. Therefore, it also does not represent the similarity between observations as a typical kernel function would do.

Remark 4

Using results from random matrix theory, we know that the effect of rank-1 perturbation of a matrix is diminishing as dimension increases. This implies that regardless of how we choose the predicates, in a problem with reasonable size of data, we are unlikely to see any meaningful impact. This can be mitigated by using more predicates; however, due to the very philosophy of LUSI, we are supposed to be stingy with respect to the number of predicates.

3.2.2 Effectiveness of the kernel increments

As illustrated in Sect. 3.2, while the LUSI solution is equivalent to the KRR solution with modified kernel, the predictions are made, however, still with the original kernel. This causes a mismatch between the solution coefficient A and the kernel K during the prediction phase, causing the model to perform worse with the predicates, when training with a sufficiently large dataset.

Theorem 3

(Predictive bias of LUSI) For a training set \(\mathcal {D}\), the LUSI solution for A in Eq. (1) induces a predictive bias in the following sense: The function \(f = KA\) is not the minimizer for the following loss

$$\begin{aligned} \mathcal {L}_\mathcal {D}(f) = \sum _\mathcal {D}(f(x_i)-y_i)^2 + \frac{\eta }{\gamma } \Vert f \Vert _K^2. \end{aligned}$$

Proof

Since the LUSI solution is equivalent to kernel ridge regression with modified kernel

$$\begin{aligned} K^\prime = K - \frac{\eta }{\gamma ^2} \Psi \left( I_m + \frac{1}{\gamma } \Psi ^{T} \Psi \right) ^{-1} \Psi ^{T}, \end{aligned}$$

the solution is optimal for function \(f = K^\prime A\) with respect to the loss function

$$\begin{aligned} \mathcal {L}_\mathcal {D}(f) = \sum _\mathcal {D}(f(x_i)-y_i)^2 + \frac{\eta }{\gamma } \Vert f\Vert _K^2. \end{aligned}$$

Therefore, \(f = KA\) is not the minimizer of the loss function above. \(\square\)

Remark 5

Theorem 3 reveals a potential factor that the LUSI method often seems to worsen the performance of the model. Since predicates are only involved during training and not part of the model, the model will produce biased predications without the predicates. However, as we show this empirically in Sect. 6.3, the formulation LUSI actually imposes rather strong restrictions such that kernel increments are of very small magnitude, which in turn also makes the predictive bias less pronounced. It is possible that the LUSI solution A can perform better than classical kernel ridge regression. Since the sample size is finite, the A of the KRR could be suboptimal in testing, and the bias induced by the predicates might be able to compensate for this suboptimality. However, it can be impractical to deliberately find such predicates, and discarding the foregone incremented kernel in testing can be a loss of opportunity.

Remark 6

Note that in Eq. (10), the modification is in the form of the subtraction of a positive semidefinite matrix. In some cases, this can potentially reduce the stability of the kernel matrix.

We can decompose the risk of LUSI-SVM \(R_{LUSI}\) as

$$\begin{aligned} R_{LUSI}&= R_{SVM} + R_{K} + R_{n}, \end{aligned}$$

where \(R_{SVM}\) is the risk caused using SVM, \(R_{K}\) is the risk caused by using a specific kernel matrix (both function & features) K, and \(R_{n}\) is the risk caused by using only finite sample size n. Due to Theorem 3, the kernel modification is not used in the testing, LUSI-SVM can only help reduce \(R_{n}\).

Moreover, the effectiveness of LUSI methods found in kernel ridge regression is due to the implicit regularization imposed by the Lagrangian multiplier. Without such regularization, the LUSI solution will collapse as in the case of only using a modified loss function. Considering the Reproducing Kernel Hilbert Space (RKHS), Vapnik (2019) obtains the following solution for LUSI

$$\begin{aligned} A = ((\gamma \mathcal {V} + \tau \mathcal {P}) K + \eta I_n)^{-1}(\gamma \mathcal {V} + \tau \mathcal {P})(Y-c I_n), \end{aligned}$$
(14)

where K is the kernel. Since the kernel is fixed before entering the equation, the predicates \(\mathcal {P}\) would not be able to further increase the performance of the model beyond the ability of K. To see this, since \(\eta\) is a small weight for the regularization, we can let \(\eta \rightarrow 0\), we have

$$\begin{aligned} \lim _{\eta \rightarrow 0} A&= K^{-1}(\gamma \mathcal {V} + \tau \mathcal {P})^{-1}(\gamma \mathcal {V} + \tau \mathcal {P})(Y - cI_n) \\&= K^{-1}(Y - cI_n). \end{aligned}$$

We can observe that the predicates will only make a difference up to the effects of regularization induced by \(\eta\). Moreover, we know that \(\eta\) is simply the Lagrangian multiplier for the condition \(A^TKA \le B\) for some \(B < \infty\). In a general machine learning setting, it is unlikely that \(A^TKA\) will be large enough in practice to be considered unbounded, which means that the constraint is likely not active.

If we look at this more closely, the purpose of this bounded constraint is to ensure that weak convergence can imply strong convergence (Vapnik & Izmailov, 2019). However, we know that strong convergence and weak convergence are equivalent in the finite domain, meaning that the equivalence between strong and weak convergence is guaranteed without any further restrictions. We summarize this in the following proposition.

Proposition 4

LUSI solution degenerates into the least squares regression solution if we set the regularization coefficient \(\eta = 0\) in Eq. (14).

Remark 7

(\(\mathcal{V}\mathcal{P}\)-LUSI model) Vapnik and Izmailov (2019) also propose a \(\mathcal {V}\)-estimate method along with using predicates (statistical invariants). Unlike predicates, this \(\mathcal {V}\) matrix is not constructed with a specific method. This will allow the matrix to have an arbitrary structure and rank, making it more flexible and effective in controlling admissible functions. For example, by construction, \(\mathcal {P}= \sum _i^m \Phi _i \Phi _i'\) is of rank m, which is much smaller than the number of observations, i.e. \(\mathcal {P}\) is a low-rank matrix. However, we can construct \(\mathcal {V}\) as a diagonal matrix and overcome the limitations of \(\mathcal {P}\). This further defeats the purpose of statistical invariants, as we are better off having constructed a matrix or kernel modification ourselves than using predicates (statistical invariants).

4 Related work

The first known effort of learning using invariant information in addition to data is learning using invariant hints (Abu-Mostafa, 1990). Schulz-Mirbach (1994) proposes to use integration calculus-based methods to construct invariant features. This route was more recently followed by Mroueh et al. (2015), Haasdonk et al. (2004), and Haasdonk and Burkhardt (2007) using Haar integration to build invariant kernels. Wood and Shawe-Taylor (1996) proposes a method based on representation theory to construct neural networks that are invariant under any given finite linear group. There are many recent efforts to build neural networks that are invariant or equivariant to perturbations or group actions e.g. Yarotsky (2022), Weiler et al. (2018), and Cohen and Welling (2016). In a more general setting, Bronstein et al. (2017, 2021) proposed geometric deep learning as a generic framework to incorporate structure information into the model. Invariant representations have also been used in unsupervised learning, such as Anselmi et al. (2016).

Unlike the efforts mentioned above to construct neural networks with a built-in invariant property, Vapnik and Izmailov (2019) and Vapnik (2019) proposed the LUSI method to formulate the learning process using invariant information through predicates, which are functions of data that carry invariant information. Vapnik (2019) also proposed the complete learning theory, as an attempt to extend the classical learning theory utilizing both strong and weak convergence in the learning process. This learning paradigm is also related to learning using privilege information (Pechyony & Vapnik, 2010; Vapnik & Vashist, 2009): both learning paradigms try to improve the learning outcome by providing information outside the data itself. In a similar spirit, Rueden et al. (2023) theoretically describes how knowledge injection can help with machine learning.

Most recently, Zhu et al. (2024) proposed a refinement of LUSI via a structured V matrix based on cluster information from the input data.

5 Case studies of predicates

Building upon the previous analysis, we examine several examples of predicates as illustrated in Vapnik and Izmailov (2019) and Vapnik (2019).

5.1 Simple predicates

There are several simple predicates suggested in Vapnik (2019):

$$\begin{aligned} \Phi (x) = 1, \;\; \Phi (x) = x,\;\; \Phi (x) = xx. \end{aligned}$$

Since they are all designed to satisfy the constraints

$$\begin{aligned} \sum _i \Phi (x_i) f(x_i) = \sum _i \Phi (x_i) y_i, \end{aligned}$$

it is equivalent to require

$$\begin{aligned} \sum _i \Phi (x_i) (f(x_i) - y_i) = 0. \end{aligned}$$

These weak constraints could be useful in an ideal scenario in which we can efficiently remove functions does not satisfy the conditions, and the learning time directly related to the size of admissible functions, thus a solution can be found in shorter time due to the predicates. However, there is no feasible algorithm to achieve this. LUSI algorithm usually comes down to a constraint optimization problem which is usually not faster with more constraints.

Remark 8

It is clear that the predicates cannot reject any solution that is consistent with the empirical data. Unfortunately, this is a fundamental flaw, as this implies that regardless of how much knowledge we have, we cannot refine a solution that is consistent with the data, which contradicts the motivation of the LUSI methods, i.e. proposing a paradigm of intelligence-driven learning.

5.2 Tangent distance

The use of tangent distance in machine learning was first proposed in Simard et al. (2000). Loosely speaking, the goal of tangent distance is to find a minimal distance between two input features over a set of transformations. It has been shown to help improve the accuracy of the classifier for learning tasks such as MNIST. Vapnik (2019) suggest the tangent distances can be used as predicates for 2D image classification tasks. The scheme of the predicate is simply to calculate the tangent distance for each observation against a fixed (pre-selected) image.

Beyond the common issues with predicates, we suggest that a predicate constructed this way loses the valuable information tangent distances could provide, which has been shown to be effective if used in feature engineering. This implies that the mechanism of the LUSI method is unable to utilize the information effectively. We will make this idea concrete in the following sections.

5.2.1 Distance to original images

To utilize this idea, we consider that the data set was generated by tangent-transforming two original images. Then we can calculate the tangent distance for each image with respect to these two original images. Since we cannot really know which one is the original, we can simply choose one at random. The predicates are

$$\begin{aligned} \Phi _1(x)&= \operatorname {dist} _{T}(x, x_1^*) \end{aligned}$$
(15)
$$\begin{aligned} \Phi _2(x)&= \operatorname {dist} _{T}(x, x_2^*) \end{aligned}$$
(16)

where \(x_1^*, x_2^*\) are the chosen original images, and \(\operatorname {dist} _T\) is the tangent distance. As features, \(\Phi _1, \Phi _2\) clearly provides crucial information about the problem; one can simply create a classification rule that returns the class with shorter tangent distance. However, as predicates, they only enter the modeling process as

$$\begin{aligned} \sum _x \Phi _i(x) f(x) = \sum _x \Phi _i(x) y_x, \end{aligned}$$

which equivalent to

$$\begin{aligned} \sum _x \Phi _i(x) (f(x) - y_x) = 0. \end{aligned}$$

As we have already shown, this condition is more likely to be satisfied by \(f(x) \approx y_x\). When f(x) is different from \(y_x\), the condition requires some residual cancellation, that is hard to find meaningful interpretation.

Furthermore, using tangent distance this way alone did not help us learn a better function that utilizes the tangent distance. Choosing to incorporate the tangent distance as an extra feature will unavoidably raise the VC-dimension of the function, contradicting the purpose and objectives of LUSI.

6 Numerical experiments

6.1 Predicates in loss functions

In these experiments, we will exploit the structure of the loss function used in LUSI and show how we can come up with effective predicates.

As suggested in Sect. 3.1.1, we can exploit the LUSI loss function to achieve better performance in case of an imbalanced data set. To focus on the role of the loss function, we will use a simple neural network to approximate an arbitrary function in this experiment.

We choose to use the common data set “Magic” (Bock, 2004) for this experiment, since it is a relatively easy data set such that we can focus on illustrating the effects of the loss functions. Based on the “Magic”dataset, we generate imbalanced training and balanced testing data: We sampled 5000 observations labeled as “gamma”and 1000 observations labeled as “hadron”, and used them as the training dataset; we also sampled different 3000 observations for each class as the testing dataset. To have more stable statistics, we repeat the sampling process 5 times and take the average.

Following our analysis in Sect. 3.1.1, we will choose the predicate

$$\begin{aligned} \Phi (x) = {\left\{ \begin{array}{ll} 0, & y_x = \text {gamma} \\ 1, & y_x = \text {hadron}, \end{array}\right. } \end{aligned}$$
(17)

where \(y_x\) denote the the label y corresponding to the input x. Since we oversampled the observations that are labeled “gamma”, we would like to regularize the false positive with the predicates, i.e. when the ground truth is “hardon”but the model predicts “gamma”, the loss is higher. According to Eq. (6), this loss function can be re-written as

$$\begin{aligned} \mathcal {L}(x_i, y_i) = \sum _x (y_i - f(x_i))^2 + \lambda \left( \sum _i \Phi (x_i) (f(x_i) - y_i) \right) ^2. \end{aligned}$$

We also compare the model performance with weighted squared loss function, i.e. we use \(\lambda \sum _i \Phi (x_i) (f(x_i) - y_i)^2\) for the second term. Furthermore, we also experiment with simple predicates introduced in Vapnik and Izmailov (2019): we choose the predicates to be the first moment of each feature, plus the constant predicate, i.e. we use 11 predicates in total.

Fig. 1
figure 1

Average testing accuracy for different loss functions and regularizations. Weights adjusted loss function and weighted squares loss function help improve testing accuracy, and accuracy is higher with stronger regularization. Accuracies for these two loss functions convergence with larger \(\lambda\). The model using moments as predicates in general has the worst performance

Figure 1 shows that both regularization schemes significantly and consistently outperform the baseline (without regularization). The LUSI method using the predicate Eq. (17) achieving the highest accuracies in most cases, however, the weighed sum of squares converges as we are increasing regularization. However, without understanding the underlying mechanism, we might choose the first moment for each feature as our predicates, although we use more predicates, the accuracy is the worst among all four methods used in the experiments.

Remark 9

The fact that some predicates can worsen the performance of the model raises concern for LUSI. Predicates should ideally have no negative impact, even if they fail to rule out all undesired functions.

6.2 Predicates for 2D images

Vapnik and Izmailov (2019) suggest that we can use heuristics to guide the model to look at the center of the image using predicates. Without loss of generality, we ran an experiment with the MNIST dataset, which we only kept images labeled “8”or “9”. Since the difference between number 8 and number 9 is mostly in the bottom half, we can use the heuristics that “focus on the bottom halves to tell the numbers apart”. Figure 2 shows some examples of images used in the experiment.

Fig. 2
figure 2

Examples of images of number “8”, and “9”. It is clear that we can focus on the lower halves of the images to distinguish them

Following the suggestion from Vapnik and Izmailov (2019), we can create the predicate that calculates the sum of the pixel values of the region of interest. In this case, we take the sum of the pixel values in the bottom halves of the images. To verify whether our predicate achieves its purpose, we will use a simple neural netwowk (fully connected with 1 hidden layer) and Layerwise Relevance Propagation (LRP) (Bach et al., 2015). LRP allows us to visualize the input image as to where the model is focused. This, in turn, allows us to know whether the predicate actually achieves its purpose, regardless of the performance of the model. Specifically, we use the predicate

$$\begin{aligned} \Phi (x) = \sum _{i \in \mathcal {I}} z_i x^{(i)}, \end{aligned}$$
(18)

where \(\mathcal {I}\) is the set of indexes of all pixels in image x, \(x^{(1)}\) is the standardized pixel value of image x at index \(i \in \mathcal {I}\), and \(z_i = 1\) if \(x^{(i)}\) is at the lower half of image x, otherwise \(z_i = 0\).

For comparison, we also show that we can achieve the desired outcome by using dropout in the upper half of the images. By training with a dropout layer, we effectively force the model to not rely on the upper half of the image too much. To have a decent effect, we choose the dropout probability to be 0.6. Since we did not focus on the accuracy of the model in this experiment, choosing the best parameter was not important. Nevertheless, we notice that using the dropout layer has only a minimal effect in this classification task in terms of prediction accuracy.

Figure 3 shows that after using the dropout layer, the upper halves of the images show relatively lighter color, indicating a lesser relevance to the prediction. Furthermore, the lower halves also show different colors that correspond to the label, i.e. for the image of the number “9”, the lower halves have more red pixels, while there are more blue pixels for the image of the number “8”. This indicates that the model actually learns to solve the task and is leveraging some of the heuristics of “looking at the lower halves” via the dropout layer.

Fig. 3
figure 3

LRP heatmap for sampled testing images. Due to dropout, the most relevance comes from lower halves. Furthermore, the lower halves of number “9” tend to be of color “red”, indicating the positive contributions, while lower halves of number “8” tend to of color “blue”, indicating negative contributions (Color figure online)

In contrast, as shown in Fig. 4, when we use the predicate Eq. (18), there is no distinguishable pattern on the heatmap, which means that the predicate does not guide the model to make better predictions.

Fig. 4
figure 4

LRP heatmap for sampled testing images for model trained using predicates. There is no distinguishable pattern on the heatmap, meaning the predicate fails to guide the model to make better predictions (Color figure online)

6.3 Predicates as incremented kernels

As illustrated in Sect. 3.2, the closed form LUSI solution is equivalent to KKR with incremented kernel. In this experiment, we demonstrate that we can potentially construct effective predicates taking advantage of this fact and the limitations of this approach.

Recall that the increment to the kernel is

$$\begin{aligned} -\frac{\eta }{\gamma ^2}\Psi (I_m + \frac{1}{\gamma }\Psi ^T \Psi )^{-1} \Psi ^T&= - \frac{\eta }{\gamma ^2} \Psi (I_m + \frac{1}{\gamma }V\Lambda V^T)^{-1} \Psi ^T\\&= - \frac{\eta }{\gamma ^2} \Psi V (I_m + \frac{1}{\gamma }\Lambda )^{-1} (\Psi V)^T\\&= - \frac{\eta }{\gamma ^2} \sum _i \mu _i V^{(i)} V^{(i)T}\\&= - \frac{\eta }{\gamma ^2} V \Pi V^T. \end{aligned}$$

where \(V^{(i)}\) is the ith column of matrix V, which is the eigenmatrix of both \(\mathcal {P}\) and \(\Psi\), \(\mu _i = \lambda _i / (1+ \frac{1}{\gamma }\lambda _i)\), which is the diagonal entry of the diagonal matrix \(\Pi\), and \(\lambda _i\) (diagonal entry of \(\Lambda\)) is the eigenvalue of \(\mathcal {P}\) (\(\mathcal {P}= \Psi \Psi ^T\) by construction). The last equality above gives an opportunity to construct effective predicates. Since kernels are additive, we can construct a more effective kernel using an incremented kernel; this has been used in literature, e.g. Chin and Suter (2007). We can approximate the incremented kernel using predicates, as suggested earlier. Suppose we can eigen-decompose a kernel increment

$$\begin{aligned} K^{Inc}&= \frac{\eta }{\gamma ^2\beta } GausKernel(x^{a}, x^{b})\\&= \frac{\eta }{\gamma ^2\beta } V \tilde{\Pi } V^T\\&= \frac{\eta }{\gamma ^2} V \Pi V^T\\ \end{aligned}$$

where \(\tilde{\Pi } = \beta \Pi\) is the diagonal matrix contains the eigenvalue of the kernel matrix \(GausKernel(x^{a}, x^{b})\). \(\beta\) is the constant that ensures that all \(\lambda _i\) (eigenvalue of predicates matrix \(\mathcal {P}\)) are positive. Thus, the predicate matrix

$$\begin{aligned} \mathcal {P}= V\Lambda V^T, \end{aligned}$$

where \(\Lambda\) is a diagonal matrix whose diagonal entry is \(\lambda _i = \mu _i/(1 - \mu _i /\gamma )\), and and the incremented kernel is \(K^\prime = K - K^{Inc}\) (the minus sign is due to Eq. (10)). We can also only keep the m largest values in the diagonal entries, so \(\mathcal {P}\) effectively the predicate matrix constructed by m predicates.

Remark 10

While we can reverse engineer the kernel increment to calculate the predicates above, the specific form of the increment we have to take is making this rather limited. Since \(\mathcal {P}\) has to be semidefinite positive, we have to adjust \(\beta\) so that \(\lambda _i \ge 0\). Doing so will lead to the kernel increments having very small values, making effects of incremented kernel diminshes. On the other hand, \(\eta\) can not be large without increasing the regularization; \(\gamma\) can not be too small otherwise \(\lambda _i\) will be negative. This means that we cannot compensate for the small magnitude with \(\eta\) or \(\gamma\). Hence, these restrictions make the valid increments have only a very limited impact on the modeling process. That is, it is not hard to find an increment for the kernel that helps the model to perform better, but it is hard to find one that can be expressed with the LUSI formulation. See also Remark 6.

We demonstrate the effects of the predicates constructed through an incremented kernel, especially the limited mentioned in Remark 10 with several datasets. Specifically, we show that the same kernel function can effectively improve the performance of the model, while restricting the incremented kernel to allow a LUSI formulation almost remove the effects of kernel increments. Incidentally, this largely negates the kernel mismatch problem since LUSI formulation excludes most if not all significant increments.

In this experiment, we compare 4 different models: (1) baseline (kernel ridge regression); (2) LUSI-SVM; (3) LUSI-SVM with adjusted kernel (LUSIAdj) during predictions (see Theorem 3); (4) incremented kernel KRR without restriction (since the incremented kernel method with LUSI restrictionsFootnote 2 is equivalent to model (3) by construction). Specifically, we do the following to remove the LUSI restrictions in model (4): (1) change the sign to “+” (see Remark 6); (2) remove the scaler \(1/\beta\) that shrinks the kernel matrix.Footnote 3 For simplicity of formulation, we focus on binary classification tasks. We follow a very simple strategy to construct our model: we use only parts of the features in the models and construct a kernel incement using the other unused features. We random sample 1200 observations on all datasets to save computational time, since calculating the predicate can be computationally expensive.

Figure 5 shows performance of test accuracy with MAGIC (Bock, 2004) dataset, we only use 2 features in our model and use the rest of them as predicates or kernel increment. Note that the incremented kernel method enjoys a significant performance gain from the increment, while both LUSI and LUSIAdj see no significant difference from the baseline, despite being based on the same kernel matrix increment.

Fig. 5
figure 5

Test accuracy comparison for 4 different models with MAGIC dataset. Simply removing the restriction of Kernel matrix enabled significant improvements, while LUSIAdj has no significant effects due to scaling effects of \(\eta /(\beta \gamma ^2)\)

We also carried out the experiment with other datasets (Lohweg, 2012; Street, 1993)– and collect the results in the Table 1. We notice that for all data sets that we tested, only the incremental kernel method is able to significantly outperform the baseline consistently. Although LUSI formulation allows us to approximate the incremented kernels, the restrictions of LUSI formulation essentially remove the effects. Interestingly, LUSI with adjusted prediction seems to have slightly worse performance than LUSI. We speculate that this is related to the minus sign of increments, which potentially hurts the positive definiteness of the kernel matrix. We leave further investigation of these subtle effects to the future.

Table 1 Placeholder caption for statistical testing results

7 Discussion: the design goal of statistical invariants

Throughout the paper, we have shown that while LUSI can be useful and effective in some scenarios, they fail to achieve their design goal of intelligence-driven learning. The motivation of LUSI is not to propose a learning algorithm that can improve the model performance by some moderate amount, but to inject knowledge into the learning process, so we can use fewer data to obtain our learning objective. However, from what we see thus far, the reasons why the LUSI method fails can be illustrated as follows. Looking at the stastical invariants

$$\begin{aligned} \sum _i \Phi (x_i) f(x_i) = \sum _i \Phi (x_i) y_i, \end{aligned}$$

it is clear that any solution that is consistent with the data will automatically satisfy the predicates condition. The fact is that the predicates can only post restrictions via the output of the model, making it play the same role (arguably weaker) as the label.

Furthermore, in the case of KRR, we have illustrated that the LUSI method is equivalent to a slight modification of the kernel. However, modifying the kernel will not fundamentally change the number of admissible solutions to the problem. This contradicts the design goal of the LUSI, which is to shrink the set of admissible functions. In addition, LUSI also suffers from the “kernel mismatch”problem and the strong constraints that prevent kernel modifications from being effective, as we demonstrate in Sect. 6.3.

Lastly, we have shown that even in the case that the LUSI method is working, it is still equivalent to an existing method with more convoluted steps, and the reason why these models work is also different from the intent. Moreover, as we show in Sect. 6.3, even there is potential to exploit the incremental kernel structure, the restriction makes it extremely hard to find effective predicates.

Philosophically, in an intelligence-driven learning paradigm, having more knowledge should not hurt the performance of the model. However, it is common for an LUSI method to have poorer performance with less than ideal predicates (if there were one that makes the performance better) (also see Sect. 1). This further suggests the failure of LUSI to achieve its design goal. Future research would focus on designing a feasible method that can effectively incorporate predicates as parts of a model, in addition to controlling the training loss during training.

8 Conclusion

In our view, the fatal flaw of LUSI is not that the method cannot work in any context, rather that it can only work with convoluted reverse engineering and with limited success (as seen in Sect. 6.1, and the fairly challenging case in Sect. 6.3). The premise of LUSI is to use our domain knowledge to reduce the need for data. Unfortunately, however, the current formulation of LUSI cannot refine any solution that is consistent with the data regardless of the amount of knowledge we have. This is against the philosophy of LUSI and defeats the purpose of LUSI.