License: CC BY 4.0
arXiv:2505.11199v3 [cs.CL] 02 Mar 2026

The Counting Power of Transformers

Marco Sälzer
RPTU Kaiserslautern-Landau
Kaiserslautern, Germany
marco.saelzer@rptu.de
&Chris Köcher
MPI-SWS
Kaiserslautern, Germany
ckoecher@mpi-sws.org
&Alexander Kozachinskiy
Centro Nacional de Inteligencia Artificial
Santiago, Chile
alexander.kozachinskyi@cenia.cl
&Georg Zetzsche
MPI-SWS
Kaiserslautern, Germany
georg@mpi-sws.org &Anthony Widjaja Lin
MPI-SWS and RPTU Kaiserslautern-Landau
Kaiserslautern, Germany
awlin@mpi-sws.org
Abstract

Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for investigating the counting power of transformers. We argue that all existing results demonstrate transformers’ expressivity only for (semi-)linear counting properties, i.e., which are expressible as a boolean combination of linear inequalities. Our main result is that transformers can express counting properties that are highly nonlinear. More precisely, we prove that transformers can capture all semialgebraic counting properties, i.e., expressible as a boolean combination of arbitrary multivariate polynomials (of any degree). Among others, these generalize the counting properties that can be captured by C-RASP softmax transformers, which capture only linear counting properties.

To complement this result, we exhibit a natural subclass of (softmax) transformers that completely characterizes semialgebraic counting properties. Through connections with the Hilbert’s tenth problem, this expressivity of transformers also yields a new undecidability result for analyzing an extremely simple transformer model — surprisingly with neither positional encodings (i.e. NoPE-transformers) nor masking. We also experimentally validate trainability of such counting properties.

1 Introduction

Transformers (Vaswani et al., 2017) have emerged in recent years as a powerful model with a plethora of successful applications including (among others) natural language processing, computer vision, and speech recognition. Despite the success of transformers, the question of what transformers can express is still not well-understood and has in recent years featured in a rich body of research works (e.g. Strobl et al. (2024); Hahn (2020); Pérez et al. (2021); Hao et al. (2022)). In particular, formal language theory provides a formal framework in understanding expressivity issues for sequential models like transformers and Recurrent Neural Networks (RNNs).

One recurring theme when studying the expressibility of transformers is the counting power of transformers. Intuitively, counting amounts to asserting an arithmetic relationship between the numbers of occurrences of various tokens in a given text. Counting properties are essentially the class of properties for textual data under consideration in the well-known Vector Space Model (VSM) (cf. Salton et al. (1975); Wong et al. (1985); Shawe-Taylor and Cristianini (2004)), or the similar Bag-of-Words (BoW) model (Harris, 1954), which are known from the information retrieval community to be surprisingly powerful in measuring text similarity (e.g. see Shahmirzadi et al. (2019); Shawe-Taylor and Cristianini (2004)). A simple example of a counting property can be found in a sentiment analysis application111https://medium.com/data-science/sentiment-analysis-with-text-mining-13dd2b33de27 : the number of positive words exceeds the number of negative words in a text. In the formal language theory, such a counting property can be formalized as the following language

𝖬𝖠𝖩:={w{a,b}:|w|a>|w|b},\mathsf{MAJ}:=\{w\in\{a,b\}^{*}:|w|_{a}>|w|_{b}\}, (1)

which is often referred to as majority. Here, |w|a|w|_{a} (resp. |w|b|w|_{b}) refers to the number of occurrences of aa (resp. bb) in the string ww. For example, 𝚊𝚊𝚋𝖬𝖠𝖩\mathtt{a}\mathtt{a}\mathtt{b}\in\mathsf{MAJ} but 𝚊𝚋𝚋𝖬𝖠𝖩\mathtt{a}\mathtt{b}\mathtt{b}\notin\mathsf{MAJ}. Note that “tokens” in NLP are synonymous to “letters” in formal language theory. Another counting property that plays an important role in the theory of expressibility of transformers is parity language:

𝖯𝖠𝖱𝖨𝖳𝖸:={w{a,b}:a occurs an even number of times in w}.\mathsf{PARITY}:=\{w\in\{a,b\}^{*}:a\text{ occurs an even number of times in }w\}. (2)

Multiple theoretical and empirical results (e.g. Hahn and Rofin (2024); Chiang and Cholak (2022); Huang et al. (2025); Hahn (2020); Hao et al. (2022); Bhattamishra et al. (2020); Anil et al. (2022); Delétang et al. (2023)) have shown that, while transformers can be efficiently trained for 𝖬𝖠𝖩\mathsf{MAJ}, this is not the case for 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY}. Several theoretical explanations have been offered, e.g., sensitivity by Hahn and Rofin (2024) and length generalization admitted by limit transformers by Huang et al. (2025)).

Thus far, existing results have touched only upon semilinear counting properties. For example, defining 𝖬𝖠𝖩\mathsf{MAJ} requires only a linear inequality (i.e. |w|a>|w|b|w|_{a}>|w|_{b}). In fact, logical languages, which were devised by Barceló et al. (2024); Yang and Chiang (2024); Huang et al. (2025) epitomizing languages expressible by transformers, permit only linear expressions (e.g. |w|a+|w|b>2|w|c|w|_{a}+|w|_{b}>2\cdot|w|_{c}). However, polynomial expressions (cf. Shawe-Taylor and Cristianini (2004)) are also used to express co-occurrence of terms/tokens in a text. For example, using a higher-degree monomial such as

#(nvidia)#(intel)#(deal),\#(\text{nvidia})\cdot\#(\text{intel})\cdot\#(\text{deal}),

where #(w)\#(w) counts the number of occurrences of a word ww in the text, one can emphasize the co-occurrence of “nvidia”, “intel” and “deal” in a text. This motivates the following question:

Research Question.

What counting properties are expressible on transformers? Can they express nonlinear counting properties?

The main contribution of this paper is the following result.

Theorem 1.1.

Transformers can capture all semialgebraic counting properties, i.e., those expressible as a boolean combination of inequalities between multivariate polynomials, where each variable counts the number of occurrences of a specific token in the text.

This means that transformers can capture expressions involving higher-degree polynomials like 7#(nvidia)#(intel)#(deal)+2#(shares)8#(war)>107\#(\text{nvidia})\cdot\#(\text{intel})\cdot\#(\text{deal})+2\#(\text{shares})-8\#(\text{war})>10, or boolean combinations (i.e. unions/intersections) of similar polynomial expressions. Consequently, by the Weierstrass theorem it follows that the set of polynomials can also approximate any continuous function on the number of occurrences of tokens. We prove this theorem (using softmax transformers) — requiring the use of neither positional encodings nor positional masking — and experimentally validate this claim.

Our next question concerns the expressivity of softmax transformers for capturing counting properties: which class of softmax transformers capture semialgebraic counting properties? To this end, we provide a surprising characterization involving average hard attention (Hao et al., 2022; Pérez et al., 2021), which was devised to “approximate” soft attention by attending to all positions with maximum attention score and forwarding their average. In particular, Average Hard Attention Transformers (AHATs) with only uniform layers (written AHAT[U]) — that is, where maximum attention score is achieved at every position — immediately form a subclass of SoftMax Attention Transformers (SMAT). In the sequel, we write 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} (resp. 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]) to mean AHAT (resp. AHAT[U]) that do not use Positional Encodings (PEs) (also no positional masking).

Theorem 1.2.

𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} and 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] capture precisely semialgebraic counting properties. In particular, as far as expressing counting properties, 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} is a subset of 𝖲𝖬𝖠𝖳\mathsf{SMAT}.

This is surprising, since it is still a major open problem whether AHAT are captured by SMAT (Yang and Chiang, 2024; Hahn, 2020; Yang et al., 2024b) for general (not necessarily counting) properties.

A corollary of Theorem 1.1, combined with Matiyasevich’s celebrated solution to the notorious Hilbert’s 10th Problem (Matiyasevich, 1993), is a kind of universality (i.e. Turing-completeness) of transformers. More precisely, any recursively enumerable counting property PΣP\subseteq\Sigma^{*} can be represented in terms of a program that, given an input string wΣw\in\Sigma^{*}, feeds each string wvwv (where vΓv\in\Gamma^{*}, for some ΓΣ=\Gamma\cap\Sigma=\emptyset) into a transformer TT and accepts if TT accepts some wvwv. In this case, we say that PP is a projection of the language accepted by TT. In fact, we show that transformers TT with only two attention layers are sufficient and necessary to achieve this result:

Theorem 1.3.

Every recursively enumerable counting property is a projection of a language recognized by a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}], and thus by an 𝖲𝖬𝖠𝖳\mathsf{SMAT}. Here, two attention layers in 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] and 𝖲𝖬𝖠𝖳\mathsf{SMAT} are sufficient.

Similarly, our results yield an undecidability result for analyzing an extremely simple transformer model—surprisingly with neither positional encodings nor masking:

Theorem 1.4.

Given a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] or 𝖲𝖬𝖠𝖳\mathsf{SMAT} (with just two attention layers), it is undecidable whether its language is empty.

Recent results (cf. (Sälzer et al., 2025)) require a substantially more complex architecture to achieve such an undecidability result, i.e., with powerful positional encoding and average hard attention.

Finally, how do general transformers compare with other machine learning models as far as capturing counting properties? To this end, let us discuss two models. First is the class of polynomial separators that can be generated by mapping to a higher dimension and look for a linear separator in this higher dimension. This is a standard technique in classical machine learning literature, where one can apply techniques like Support Vector Machines (SVM) (e.g. using polynomial kernel) in the Vector Space Model (VSM) (Salton et al. (1975); Wong et al. (1985); also see Chapter 10 of Shawe-Taylor and Cristianini (2004)). Our result shows that transformers generalize such counting properties: not only polynomial counting properties can be captured, but also boolean combinations thereof. Second is the model called C-RASP (Huang et al., 2025), which is a simple declarative language that formalizes the so-called RASP-L conjecture (Zhou et al., 2024) capturing “efficiently learnable” properties on transformers. In particular, C-RASP allows only linear counting terms. We prove that C-RASP can capture only linear counting properties. Since our experiments supporting Theorem 1.1 reveals that counting properties like Lk:={w{a,b}+:|w|ak|w|b}L_{k}:=\{w\in\{a,b\}^{+}:|w|_{a}^{k}\geq|w|_{b}\} are also efficiently learnable for k2k\geq 2, it follows that C-RASP is only a partial characterization of efficiently learnable properties.

𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1}]𝖰𝖥𝖯𝖠\mathsf{QFPA}𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT}𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{SemiAlg}𝖠𝖧𝖠𝖳[𝖴]\mathsf{AHAT}[\mathsf{U}]𝖲𝖬𝖠𝖳\mathsf{SMAT}𝖠𝖧𝖠𝖳\mathsf{AHAT}\subsetneq\subseteq\subseteq\subseteqTheorem˜B.2Theorem˜1.2
Figure 1: Visualization of our results.

Organization.

We recall transformer models and define our framework for studying counting properties in Section˜2. We then show how to capture semialgebraic counting properties using transformers in Section˜3. In Section˜4, we provide a natural subclass of softmax transformers that completely characterizes semialgebraic counting properties. In Section˜5, we show applications of our semialgebraic results for a better understanding of expressiveness of transformers, e.g., universality/undecidability and comparison to work on C-RASP transformers. We report our experimental results in Section˜6 and conclude in Section˜7. Some details have been relegated into the Appendix.

2 Framework: Transformers and Counting Properties

Formal language theory primer

We assume some basic understanding of formal language theory (at the level of a standard undergraduate textbook by Sipser (2013)) and will only fix some notation.

For an alphabet Σ={𝚊1,,𝚊m}\Sigma=\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}. A language is a set of strings over Σ\Sigma. We write Σ\Sigma^{*} (resp. Σ+\Sigma^{+}) to mean the set of all strings (resp. all nonempty strings) over Σ\Sigma. We write |w||w| to denote the length of ww. For each aΣa\in\Sigma, we write |w|a|w|_{a} to mean the number of occurrences of aa in ww. A language KΣK\subseteq\Sigma^{*} is a projection of a language LΣL\subseteq\Sigma^{*} if there is a subalphabet ΓΣ\Gamma\subseteq\Sigma such that KK is obtained from LL by deleting all occurrences of letters in Γ\Gamma from words in LL. For a class 𝒞\mathcal{C} of languages, by 𝖯𝗋𝗈𝗃(𝒞)\mathsf{Proj}(\mathcal{C}), we denote the class of projections of languages in 𝒞\mathcal{C}.

We will touch upon regular languages and recursively enumerable languages (see Sipser (2013) for details). In summary, regular languages are languages that can be described by regular expressions. Recursively enumerable languages are those that are recognized by (possibly nonterminating) Turing machines. The class of such languages is denoted 𝖱𝖤\mathsf{RE}. In particular, a machine model is said to be Turing-complete if it can capture all recursively enumerable languages.

For an alphabet Σ={𝚊1,,𝚊m}\Sigma=\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}, we define the Parikh image (a.k.a. Parikh map) as the function Ψ:Σm\Psi\colon\Sigma^{*}\to\mathbb{N}^{m}, where Ψ(w)[i]:=|w|𝚊i\Psi(w)[i]:=|w|_{\mathtt{a}_{i}} is the number of 𝚊i\mathtt{a}_{i}’s in ww. Intuitively, Parikh image of a word ww provides the letter counts in ww, e.g., over Σ={𝚊,𝚋}\Sigma=\{\mathtt{a},\mathtt{b}\}, we have Ψ(abaa)=(3,1)\Psi(abaa)=(3,1). The Parikh map can also be extended to a language LL; that is, Ψ(L)={Ψ(w):wL}|Σ|\Psi(L)=\{\Psi(w):w\in L\}\subseteq\mathbb{N}^{|\Sigma|}. For example, if L={𝚊n𝚋n𝚊n:n0}L=\{\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{a}^{n}:n\geq 0\} is a language over Σ={𝚊,𝚋}\Sigma=\{\mathtt{a},\mathtt{b}\}, we have Ψ(L)={(2n,n):n0}\Psi(L)=\{(2n,n):n\geq 0\}.

2.1 Transformers

We now recall the formal definition of transformers. Loosely speaking, a transformer is a composition of finitely many attention layers, each converting a sequence σ\sigma of d\mathbb{R}^{d}-vectors into another sequence σ\sigma^{\prime} of k\mathbb{R}^{k}-vectors, for some dd and kk. To turn a transformer TT into a language recognizer, we have to embed any letter in the finite alphabet Σ\Sigma as a d\mathbb{R}^{d}-vector, where dd is smaller than the dimension of the first attention layer. For example, Σ={𝚊,𝚋,𝚌}\Sigma=\{\mathtt{a},\mathtt{b},\mathtt{c}\}, and the one-hot embeddings of 𝚊\mathtt{a}, 𝚋\mathtt{b}, 𝚌\mathtt{c} are (respectively) (1,0,0)(1,0,0), (0,1,0)(0,1,0), and (0,0,1)(0,0,1). Finally, to determine acceptance, we simply run TT on the embeddings of the input string ww into a sequence of vectors (possibly expanded with positional information) and check if the last vector 𝒗\bm{v} satisfies that the dot product 𝒗.𝒕\bm{v}.\bm{t} is greater than 0 (for some pre-defined vector 𝒕\bm{t} of weights). In particular, ww is accepted by TT iff 𝒗.𝒕>0\bm{v}.\bm{t}>0.

Example.

Suppose we are given the input string w=𝚊𝚋𝚊𝚌w=\mathtt{a}\mathtt{b}\mathtt{a}\mathtt{c}. Additionally, suppose we use the positional embedding p:n1/np\colon n\mapsto 1/n. Then, checking whether TT accepts ww amounts to running TT on the sequence σ\sigma:

(1,0,0,1)(0,1,0,1/2)(1,0,0,1/3)(0,0,1,1/4).(1,0,0,1)(0,1,0,1/2)(1,0,0,1/3)(0,0,1,1/4).

After running TT on σ\sigma, the resulting sequence is of the form 𝐯1,𝐯2,𝐯3,𝐯4\bm{v}_{1},\bm{v}_{2},\bm{v}_{3},\bm{v}_{4}. Determining whether TT accepts ww amounts to checking whether 𝐭.𝐯1>0\bm{t}.\bm{v}_{1}>0. For example, 𝐯1,𝐯2,𝐯3,𝐯4\bm{v}_{1},\bm{v}_{2},\bm{v}_{3},\bm{v}_{4} could be:

(1,1,7,1,1)(2,3,1,10,1/2)(1,8,0,8,1/3)(0,0,1,1,1/4)(1,1,7,1,1)(2,3,1,10,1/2)(1,8,0,8,1/3)(0,0,1,-1,1/4)

which will be accepted, whenever t=(1,0,0,1,0)t=(1,0,0,1,0).

Next we formalize the definition of transformers by defining how each attention layer functions.

ReLU networks.

We first define ReLU networks, which are used inside an attention layer. A ReLU node vv is a function m\mathbb{Q}^{m}\to\mathbb{Q}, where mm\in\mathbb{N} is referred to as the input dimension, and is defined as v(x1,,xm)=max(0,b+i=1nwixi)v(x_{1},\dotsc,x_{m})=\max(0,b+\sum_{i=1}^{n}w_{i}x_{i}), where wiw_{i}\in\mathbb{Q} are the weights, and bb\in\mathbb{Q} is the bias. [In practice, GeLU and SwiGLU are also used instead of ReLU, which we do not consider in this paper.] A ReLU layer \ell is a tuple of ReLU nodes (v1,,vn)(v_{1},\dotsc,v_{n}), all having the same input dimensionality, computing a function mn\mathbb{R}^{m}\to\mathbb{R}^{n}, where nn\in\mathbb{N} is referred to as the output dimension. Finally, a ReLU network 𝒩\mathcal{N} is a tuple of ReLU layers (1,,k)(\ell_{1},\dotsc,\ell_{k}), such that the input dimension of i+1\ell_{i+1} is equal to the output dimension of i\ell_{i}. It computes a function m1nk\mathbb{Q}^{m_{1}}\to\mathbb{Q}^{n_{k}}, given by 𝒩(x1,,xm1)=k(1(x1,,xm1))\mathcal{N}(x_{1},\dotsc,x_{m_{1}})=\ell_{k}(\dotsb\ell_{1}(x_{1},\dotsc,x_{m_{1}})\dotsb).

Attention layers

Each attention layer involves a weight normalizer wt:\texttt{wt}:\mathbb{R}^{*}\to\mathbb{R}^{*}, which turns any dd-sequence of weights into another such dd-sequence. Two widely used weight normalizers are:

  1. 1.

    The softmax normalizer softmax\mathrm{softmax}. That is, given a sequence σ=x1,,xn\sigma=x_{1},\ldots,x_{n}\in\mathbb{R}, define softmax(σ):=y1,,yn\mathrm{softmax}(\sigma):=y_{1},\ldots,y_{n}, where yi:=exij=1nexjy_{i}:=\frac{e^{x_{i}}}{\sum_{j=1}^{n}e^{x_{j}}}.

  2. 2.

    The averaging hard attention normalizer aha\mathrm{aha}. We define aha(σ):=y1,,yn\mathrm{aha}(\sigma):=y_{1},\ldots,y_{n}, where

    yi:={1/|P| if xi=max(σ)0 or else. y_{i}:=\left\{\begin{array}[]{cc}1/|P|&\text{ if $x_{i}=\max(\sigma)$, }\\ 0&\text{ or else. }\end{array}\right.

    where PP consists of positions ii in σ\sigma such that xix_{i} is maximum in σ\sigma. That is, aha\mathrm{aha} behaves like softmax\mathrm{softmax} but maps all non-maximum weights to 0, and all maximum weights to 1/|P|1/|P|.

One can also allow a temperature scaling τ>0\tau>0 to softmax\mathrm{softmax}, i.e., softmaxτ(σ)=y1,,yn\mathrm{softmax}_{\tau}(\sigma)=y_{1},\ldots,y_{n} and set yi:=exi/τj=1nexj/τy_{i}:=\frac{e^{x_{i}/\tau}}{\sum_{j=1}^{n}e^{x_{j}/\tau}}. This is not so relevant in our paper since our proof works for any τ>0\tau>0.

An attention layer is a function λ:(d)(e)\lambda\colon(\mathbb{R}^{d})^{*}\to(\mathbb{R}^{e})^{*}, given by affine maps Q,K:dmQ,K\colon\mathbb{R}^{d}\to\mathbb{R}^{m}, V:dkV\colon\mathbb{R}^{d}\to\mathbb{R}^{k} (query, key, and value matrices) and a ReLU neural net 𝒩:d+ke\mathcal{N}\colon\mathbb{Q}^{d+k}\to\mathbb{Q}^{e}. Given an input sequence x=(𝒙1,,𝒙n)(d)nx=(\bm{x}_{1},\ldots,\bm{x}_{n})\in(\mathbb{Q}^{d})^{n}, the output sequence y=(𝒚1,,𝒚n)(d)ny=(\bm{y}_{1},\ldots,\bm{y}_{n})\in(\mathbb{Q}^{d})^{n} is computed as follows. First, one computes the sequences of key, query, and value vectors: 𝒌i=K𝒙i,𝒒i=Q𝒙i,𝒗i=V𝒙i\bm{k}_{i}=K\bm{x}_{i},\,\,\bm{q}_{i}=Q\bm{x}_{i},\,\,\bm{v}_{i}=V\bm{x}_{i}, for each i=1,,ni=1,\ldots,n, then we define 𝒚i=𝒩(𝒙i,𝒂i)\bm{y}_{i}=\mathcal{N}(\bm{x}_{i},\bm{a}_{i}), with 𝒂i=j=1n𝒘(j)𝒗j\bm{a}_{i}=\sum_{j=1}^{n}\bm{w}(j)\bm{v}_{j}, where 𝒘=wt({𝒌i,𝒒j}j=1n)\bm{w}=\texttt{wt}(\{\langle\bm{k}_{i},\bm{q}_{j}\rangle\}_{j=1}^{n}).

We say that λ\lambda is a softmax (resp. aha) layer if wt=softmax\texttt{wt}=\mathrm{softmax} (resp. aha\mathrm{aha}). We say that it is a uniform-aha\mathrm{aha} layer if it is an aha\mathrm{aha} layer such that K𝒙=Q𝒙=𝟎K\bm{x}=Q\bm{x}=\bm{0} for all 𝒙\bm{x}, i.e., K𝒙,Q𝒚=0\langle K\bm{x},Q\bm{y}\rangle=0 for all 𝒙\bm{x} and 𝒚\bm{y}. Note that a uniform-aha\mathrm{aha} is both an aha\mathrm{aha} layer and a softmax\mathrm{softmax} layer since noting that

softmax(s1,,sn)=softmaxτ(s1,,sn)=aha(s1,,sn)=[1/n,,1/n],\mathrm{softmax}(s_{1},\ldots,s_{n})=\mathrm{softmax}_{\tau}(s_{1},\ldots,s_{n})=\mathrm{aha}(s_{1},\ldots,s_{n})=[1/n,\cdots,1/n],

whenever s1==sns_{1}=\cdots=s_{n}, which can be guaranteed for uniform aha\mathrm{aha} layers. This holds for all τ>0\tau>0.

Remark.

Some papers (e.g. Yang et al. (2024a); Huang et al. (2025); Yang and Chiang (2024)) apply strict future masking, which means that attention is only applied to positions up to the current position ii. Our work does not apply masking.

Defining transformers.

To define a transformer and its language, we first extend the finite alphabet Σ\Sigma with an end marker $Σ\mathdollar\notin\Sigma. That is, Γ:=Σ{$}\Gamma:=\Sigma\cup\{\mathdollar\}. A transformer with \ell layers over a finite alphabet Σ\Sigma is then a function T:Σ+{0,1}T\colon\Sigma^{+}\to\{0,1\}, given by: (i) the “input embedding” function ι:Γd1\iota\colon\Gamma\to\mathbb{Q}^{d_{1}}, (ii) the positional encoding p:2d1p\colon\mathbb{N}^{2}\to\mathbb{R}^{d_{1}}, and (iii) a sequence of layers λ1:(d1)(d2),,λ:(d)(d+1)\lambda_{1}\colon(\mathbb{R}^{d_{1}})^{*}\to(\mathbb{R}^{d_{2}})^{*},\ldots,\lambda_{\ell}\colon(\mathbb{R}^{d_{\ell}})^{*}\to(\mathbb{R}^{d_{\ell+1}})^{*}. Given an input word w=a1anΣnw=a_{1}\cdots a_{n}\in\Sigma^{n}, the output T(w)T(w) is computed as follows. First, we set 𝒙1=ι(a1)+p(n+1,1),,𝒙n=ι(an)+p(n+1,n),𝒙n+1=ι($)+p(n+1,n)\bm{x}_{1}=\iota(a_{1})+p(n+1,1),~\ldots,~\bm{x}_{n}=\iota(a_{n})+p(n+1,n),~\bm{x}_{n+1}=\iota(\mathdollar)+p(n+1,n). Then we compute (𝒚1,,𝒚n+1)=λ(λ1(λ1(𝒙1,,𝒙n+1)))(\bm{y}_{1},\ldots,\bm{y}_{n+1})=\lambda_{\ell}(\lambda_{\ell-1}(\cdots\lambda_{1}(\bm{x}_{1},\ldots,\bm{x}_{n+1})\cdots)), and we set T(w)=1T(w)=1 if and only if 𝒚n[1]>0\bm{y}_{n}[1]>0, and T(w)=0T(w)=0 otherwise. The language L(T)L(T) accepted by TT is defined as {wΣ:T(w)=1}\{w\in\Sigma^{*}\colon T(w)=1\}. We say that TT has no positional encoding (NoPE) if the positional encoding is a constant function.

Remark.

Several studies (e.g., Merrill and Sabharwal (2023b); Sälzer et al. (2025); Li and Cotterell (2025)) consider the capabilities of transformers in the context of restricted precision, such as assuming computations are carried out under the assumption of finite representation sizes. We do not focus on these aspects, but note that it is easy to see that our key results, such as Proposition˜3.1, also apply under so-called log-precision assumptions (cf. Merrill and Sabharwal (2023b); also see Merrill and Sabharwal (2023a)) for rational numbers. This means that the binary representation size of a number p/qp/q\in\mathbb{Q} grows logarithmically with the length of the input.

A Softmax Attention Transformer is a transformer using only softmax\mathrm{softmax} layers whereas an AHA Transformer is a transformer using only aha\mathrm{aha} layers. By 𝖲𝖬𝖠𝖳\mathsf{SMAT} we denote the class of all languages accepted by softmax attention transformers and by 𝖠𝖧𝖠𝖳\mathsf{AHAT} we denote the class of all languages accepted by AHA transformers. To all classes we of transformer languages we append “[𝖴][\mathsf{U}]” to denote languages of transformers with only uniform layers, e.g. 𝖠𝖧𝖠𝖳[𝖴]\mathsf{AHAT}[\mathsf{U}]. We prepend “𝖭𝗈𝖯𝖤\mathsf{NoPE}” to denote only languages of transformers with no positional encoding, e.g. 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. Note that all transformer models we are considering in this paper have only one attention head.

2.2 Counting Properties

We now define a framework for studying the counting ability of transformers. Intuitively, our framework focuses on “counting properties”. As we shall see below, we can build many interesting formal languages with the help of purely counting properties.

Given a permutation π:{1,,n}{1,,n}\pi:\{1,\ldots,n\}\to\{1,\ldots,n\} and a string w=w1wnw=w_{1}\cdots w_{n} of length nn, the string π(w):=wπ(1)wπ(n)\pi(w):=w_{\pi(1)}\cdots w_{\pi(n)} is obtained by permuting the letters in ww according to π\pi.

Definition 2.1.

A counting property over the alphabet Σ\Sigma is a permutation-closed language LL, i.e., for each wΣw\in\Sigma^{*}, it is the case that wLw\in L iff π(w)L\pi(w)\in L for each permutation π\pi over {1,,|w|}\{1,\ldots,|w|\}.

Examples of counting properties are 𝖬𝖠𝖩\mathsf{MAJ} and 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} (see (1), (2)). We often identify a counting property LL with its set Ψ(w)|Σ|\Psi(w)\subseteq\mathbb{N}^{|\Sigma|} of letter counts (i.e. Parikh image). By 𝖯𝖨\mathsf{PI}, we denote the class of counting properties over Σ\Sigma. Counting properties are also called permutation-invariant or “proportion-invariant” languages, e.g., see Pérez et al. (2021); Barceló et al. (2024).

Why counting properties?

Certainly, many languages of interests have both a “counting component” and an “order component”. Take, for example, the language L1={𝚊n𝚋n𝚌n:n0}L_{1}=\{\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}:n\geq 0\}. Our framework focuses on purely counting properties for two reasons. Firstly, it abstracts away non-counting components that cannot be captured by the model. Secondly, many formal languages LL of interests can be constructed by taking intersection of a counting property PP and an order (and counting-insensitive) language LL^{\prime}. For example, L1L_{1} above can be written as PLP\cap L^{\prime}, where P={wΣ:|w|𝚊=|w|𝚋=|w|𝚌}P=\{w\in\Sigma^{*}:|w|_{\mathtt{a}}=|w|_{\mathtt{b}}=|w|_{\mathtt{c}}\} and L=𝚊𝚋𝚌L^{\prime}=\mathtt{a}^{*}\mathtt{b}^{*}\mathtt{c}^{*}. Finally, multiple key languages in the literature on the expressivity of transformers are in fact counting properties (e.g. 𝖬𝖠𝖩\mathsf{MAJ} and 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY}).

3 Capturing Semialgebraic Counting Properties

A subset SmS\subseteq\mathbb{N}^{m} is semi-algebraic if it is a Boolean combination of sets of the form Sp={𝒙mp(𝒙)>0}S_{p}=\{\bm{x}\in\mathbb{N}^{m}\mid p(\bm{x})>0\} for some polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}]. A language LΣL\subseteq\Sigma^{*} is semi-algebraic if there is a semi-algebraic set SmS\subseteq\mathbb{N}^{m} and Σ={𝚊1,,𝚊m}\Sigma=\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\} such that L={w{𝚊1,,𝚊m}Ψ(w)S}L=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid\Psi(w)\in S\}. Let 𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{SemiAlg} denote the class of semi-algebraic languages. An example is

𝖲𝖰𝖱𝖳={w{𝚊,𝚋}|w|𝚊<|w|/2},,\mathsf{SQRT}=\{w\in\{\mathtt{a},\mathtt{b}\}^{*}\mid|w|_{\mathtt{a}}<|w|/\sqrt{2}\},, (3)

since |w|𝚊<|w|/2|w|_{\mathtt{a}}<|w|/\sqrt{2} if and only if 2|w|𝚊2<|w|22|w|_{\mathtt{a}}^{2}<|w|^{2}. Likewise, extending the coefficients of our polynomials to rational numbers does not increase the expressiveness of semialgebraic sets, e.g., 73xy+y2>8x3\tfrac{7}{3}xy+y^{2}>8x-3 can be rewritten as 7xy+3y2>24x97xy+3y^{2}>24x-9. Note that for every p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}], the set {𝒙mp(𝒙)=0}\{\bm{x}\in\mathbb{N}^{m}\mid p(\bm{x})=0\} is semi-algebraic, because p(𝒙)=0p(\bm{x})=0 if and only if p(𝒙)2+1>0-p(\bm{x})^{2}+1>0. Thus, every solution set to polynomial equations is also semi-algebraic.

We show Theorem 1.1. Since 𝖠𝖧𝖠𝖳[𝖴]𝖲𝖬𝖠𝖳\mathsf{AHAT}[\mathsf{U}]\subseteq\mathsf{SMAT}, it sufices to construct a 𝖠𝖧𝖠𝖳[𝖴]\mathsf{AHAT}[\mathsf{U}]. We will even construct a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. The key ingredient is:

Proposition 3.1.

For every polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}], the language Lp>0={w{𝚊1,,𝚊m}p(Ψ(w))>0}L_{p>0}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid p(\Psi(w))>0\} belongs to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. Thus, Lp>0L_{p>0} is in SMAT.

Let us see why Proposition˜3.1 implies 𝖲𝖾𝗆𝗂𝖠𝗅𝗀𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{SemiAlg}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. First, the complement of each language Lp>0L_{p>0} can be obtained, because p(𝒙)>0p(\bm{x})>0 is violated if and only if p(𝒙)+1>0-p(\bm{x})+1>0. Moreover, 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} is closed under union and intersection (we prove a stronger fact in Section˜A.2). We can thus accept all Boolean combinations of languages of the form Lp>0L_{p>0}, and hence 𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{SemiAlg}.

To show Proposition˜3.1, we will use polynomials that are homogeneous, meaning all monomials have the same degree. Note that given an arbitrary polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] of degree dd, we can consider the polynomial q[X0,,Xm]q\in\mathbb{Z}[X_{0},\ldots,X_{m}] with q=X0dp(X1X0,,XmX0)q=X_{0}^{d}p(\tfrac{X_{1}}{X_{0}},\ldots,\tfrac{X_{m}}{X_{0}}), which is homogeneous. It has the property that p(x1,,xm)>0p(x_{1},\ldots,x_{m})>0 if and only if q(1,x1,,xm)>0q(1,x_{1},\ldots,x_{m})>0. Therefore, from now on, we assume that we have a homogeneous polynomial q[X0,,Xm]q\in\mathbb{Z}[X_{0},\ldots,X_{m}] and want to construct an AHAT[U] for the language Kq={w{𝚊1,,𝚊m}q(1,𝒙)>0 for 𝒙=Ψ(w)}K_{q}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid\text{$q(1,\bm{x})>0$ for $\bm{x}=\Psi(w)$}\}.

To simplify notation, we denote the end marker by 𝚊0\mathtt{a}_{0}. Thus, the input will be a string w{𝚊0,,𝚊m}+w\in\{\mathtt{a}_{0},\ldots,\mathtt{a}_{m}\}^{+} that contains 𝚊0\mathtt{a}_{0} exactly once, at the end. Since |w|𝚊0=1|w|_{\mathtt{a}_{0}}=1 is satisfied automatically, our AHAT[U] only has to check that q(x0,,xm)>0q(x_{0},\ldots,x_{m})>0, where xi=|w|𝚊ix_{i}=|w|_{\mathtt{a}_{i}}. The input encoding is the map {𝚊0,,𝚊m}m\{\mathtt{a}_{0},\ldots,\mathtt{a}_{m}\}^{*}\to\mathbb{Q}^{m} with 𝚊i𝒆i\mathtt{a}_{i}\mapsto\bm{e}_{i}, where 𝒆im\bm{e}_{i}\in\mathbb{Q}^{m} is the ii-th unit vector.

Overall idea

Roughly speaking, we implement multiplication via averaging as follows. For each letter 𝚊i\mathtt{a}_{i}, we have a gadget that can multiply an existing entry y[0,1]y\in[0,1] (in each vector) by xin+1\tfrac{x_{i}}{n+1} (recall that nn is the overall word length). This is done by first multiplying the existing entries either (i) by 11 if the current letter is 𝚊i\mathtt{a}_{i} or (ii) by 0 if the current letter is not 𝚊i\mathtt{a}_{i}. This is achieved using a ReLU layer, by observing that for u[0,1]u\in[0,1] and v{0,1}v\in\{0,1\}, we have uv=ReLU(u(1v))u\cdot v=\operatorname{ReLU}(u-(1-v)). After this, we average over the entire input in this component. Since we make sure that all the entries we multiplied with 0 or 11 had the same value y[0,1]y\in[0,1], taking the average will result in the value yxin+1\tfrac{y\cdot x_{i}}{n+1}. Repeating this for a monomial xi1xidx_{i_{1}}\cdots x_{i_{d}}, we arrive at the value xi1xid(n+1)d\tfrac{x_{i_{1}}\cdots x_{i_{d}}}{(n+1)^{d}}. Since our homogenization step ensured that all our monomials have the same degree dd, adding up the entries corresponding to the monomials will yield p(Ψ(w))(n+1)d\tfrac{p(\Psi(w))}{(n+1)^{d}}. Finally, the latter quantity is positive if and only if p(Ψ(w))>0p(\Psi(w))>0.

Step I: Compute frequencies

Our AHAT[U] first uses an attention layer to compute m+1m+1 new components, where ii-th component holds xin+1\tfrac{x_{i}}{n+1}, where n+1n+1 is the length of the input (including the end marker). This is easily done by attending to all positions and computing the averages of the first m+1m+1 components. To simplify notation, we will index vectors starting with index 0.

Step II: Multiplication gadgets

Second, we have a sequence of gadgets (each consisting of one ReLU layer and one attention layer) that perform the multiplication. Each gadget introduces a new component, and does not change the existing components. Between gadget executions, the following additional invariants are upheld: (i) Overall, a gadget does not change existing components: it introduces one new component. (ii) The components {0,,m}\{0,\ldots,m\} are called the initial components. (iii) All other components are uniform, i.e. they are the same across all positions. (iv) The uniform components carry values in [0,1][0,1]. Thus, we will call components 0,,m0,\ldots,m the initial components; and we call components >m>m the uniform components.

Our gadgets do the following. Suppose we have already produced \ell additional components. For each initial component i[0,m]i\in[0,m] and uniform component j[m+1,m+1+]j\in[m+1,m+1+\ell], gadget 𝗈𝗆𝗎𝗅𝗍(,i,j)\mathsf{omult}(\ell,i,j), which introduces a new component, will carry the value xiyjn+1,\frac{x_{i}\cdot y_{j}}{n+1}, where yjy_{j} is the value in component jj of all vectors. Recall that we use xix_{i} to denote the number of 𝚊i\mathtt{a}_{i} occurrences in the input for i[0,m]i\in[0,m].

We implement the gadget 𝗈𝗆𝗎𝗅𝗍(,i,j)\mathsf{omult}(\ell,i,j) using some ReLU layers and an attention layer. Suppose that before, we have the vector 𝒖pm+1+\bm{u}_{p}\in\mathbb{Q}^{m+1+\ell} in position pp. First, using ReLU layers, we introduce a new component that in position pp has the value 𝒖p[i]𝒖p[j]\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]. This can be achieved since 𝒖p[i]\bm{u}_{p}[i] is in {0,1}\{0,1\} and 𝒖p[j][0,1]\bm{u}_{p}[j]\in[0,1]: Notice that 𝒖p[i]𝒖p[j]=ReLU(𝒖p[j](1𝒖p[i]))\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]=\operatorname{ReLU}(\bm{u}_{p}[j]-(1-\bm{u}_{p}[i])). Indeed, if 𝒖p[i]=1\bm{u}_{p}[i]=1, then this evaluates to 𝒖p[j]\bm{u}_{p}[j]; if 𝒖p[i]=0\bm{u}_{p}[i]=0, then we get ReLU(𝒖p[j]1)=0\operatorname{ReLU}(\bm{u}_{p}[j]-1)=0. We then use uniform attention to compute the average of this new 𝒖p[i]𝒖p[j]\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]-component across all vectors. Since there are n+1n+1 vectors, exactly xix_{i} of them have 𝒖p[i]=1\bm{u}_{p}[i]=1, and also 𝒖p[j]=yj\bm{u}_{p}[j]=y_{j}, we get the desired xiyjn+1\tfrac{x_{i}\cdot y_{j}}{n+1}.

Step III: Computing the polynomial

We now use our gadgets to compute the value of the polynomial. For each monomial of qq, say Xi1XidX_{i_{1}}\cdots X_{i_{d}}, we use d1d-1 gadgets to compute xi1xid/(n+1)dx_{i_{1}}\cdots x_{i_{d}}/(n+1)^{d}: The frequency computation in the beginning yields xi1/(n+1)x_{i_{1}}/(n+1), and then we use gadgets to compute xi1xi2/(n+1)2x_{i_{1}}x_{i_{2}}/(n+1)^{2}, xi1xi2xi3/(n+1)3x_{i_{1}}x_{i_{2}}x_{i_{3}}/(n+1)^{3}, etc. until xi1xid/(n+1)dx_{i_{1}}\cdots x_{i_{d}}/(n+1)^{d}. Finally, we use a ReLU layer to multiply each monomial with a rational coefficient, and compute the sum of all the monomials. Thus, we have computed q(x0,,xm)/(n+1)dq(x_{0},\ldots,x_{m})/(n+1)^{d}. We accept if and only if q(x0,,xm)/(n+1)d>0q(x_{0},\ldots,x_{m})/(n+1)^{d}>0. Note that this is the case if and only if q(x0,,xm)>0q(x_{0},\ldots,x_{m})>0.

This completes Proposition˜3.1 and thus 𝖲𝖾𝗆𝗂𝖠𝗅𝗀𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{SemiAlg}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. We remark that the embedding dimension and the number of layers of our transformer in Proposition 3.1 depends on the degree dd and the number MM of monomials in pp. We require at most O(d)O(d) layers, each layer increasing the degree of the computed monomials by one. In the appendix, we detailed that polynomials of degree dd are accepted by 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] using at most dd attention layers (see Proposition˜A.1). The embedding dimension is O(dM)O(dM) because we store the value of each monomial in a separate dimension.

4 Characterizing semi-algebraic counting properties

We have shown that 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]𝖲𝖬𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]\subseteq\mathsf{SMAT} can capture semi-algebraic counting properties. We now prove that the subclass 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] precisely characterizes 𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{SemiAlg}.

Proposition 4.1.

𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{NoPE\mathchar 45\relax AHAT}\subseteq\mathsf{SemiAlg}.

Proof.

Suppose that Σ={𝚊1,,𝚊m}\Sigma=\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\} is our alphabet, 𝚊0\mathtt{a}_{0} the end marker, and xix_{i}\in\mathbb{N} the number of occurrences of 𝚊i\mathtt{a}_{i} in the input. We say that a position pp is an 𝚊i\mathtt{a}_{i}-position if the input holds 𝚊i\mathtt{a}_{i} at position pp. Notice that an AHAT without positional encoding cannot distinguish vectors that come from the same input letter. This means, in any layer, any two 𝚊i\mathtt{a}_{i}-positions will hold the same vector. Thus, the vector sequence on layer \ell is described by rational vectors 𝒖,0,,𝒖,m\bm{u}_{\ell,0},\ldots,\bm{u}_{\ell,m}, where 𝒖,i\bm{u}_{\ell,i} is the vector at all the 𝚊i\mathtt{a}_{i}-positions on layer \ell. Moreover, for each ii, the set of positions maximizing an attention score also either contains all 𝚊i\mathtt{a}_{i}-positions, or none of them. Therefore, if the AHAT has aa attention layers, there are at most ((2m+1)m+1)a=2(m+1)2a((2^{m+1})^{m+1})^{a}=2^{(m+1)^{2}a} possible ways to choose the positions of maximal score: On each attention layer, and for each i[0,m]i\in[0,m], we select a subset of the m+1m+1 letters. For each ReLU node and each ii, there are two ways its expression ReLU(v)\operatorname{ReLU}(v) can be evaluated: as 0 or as vv. Thus, if there are rr ReLU nodes, then there are 2r2^{r} ways to evaluate all those nodes.

For each of these 2r+(m+1)2a2^{r+(m+1)^{2}a} choices, we construct a conjunction of polynomial inequalities that verify that (i) this choice actually maximized scores, (ii) the resulting vector at the right-most position in the last layer satisfies the accepting condition. This is easy to do by building, for each layer \ell and each ii, expressions in x1,,xmx_{1},\ldots,x_{m} for the vectors 𝒖,i\bm{u}_{\ell,i}, assuming our choice above. These expressions have the form p(x1,,xm)/q(x1,,xm)p(x_{1},\ldots,x_{m})/q(x_{1},\ldots,x_{m}) (averaging can introduce denominators). Here, once we have expressions for 𝒖,i\bm{u}_{\ell,i}, we can use them to build expressions for 𝒖+1,i\bm{u}_{\ell+1,i} by following the definition of AHAT. Checking (i) and (ii) is then also easy, because inequalities involving quotients p(x1,,xm)/q(x1,,xm)p(x_{1},\ldots,x_{m})/q(x_{1},\ldots,x_{m}) can be turned into polynomial inequalities by multiplying with common denominators. Finally, we take a disjunction over all 2r+(m+1)a2^{r+(m+1)a} conjunctions. ∎

Inexpressibility of 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY}.

Our characterization of 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} (i.e. Proposition˜4.1) implies an interesting inexpressibility result regarding 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} (see (2):

Corollary 4.2.

𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} does not belong to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT}.

𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} is known to be accepted by AHAT (Barceló et al., 2024) and by SMAT (Chiang and Cholak, 2022) (with PE). Inexpressibility of 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} in a length-generalizable subclass of 𝖲𝖬𝖠𝖳\mathsf{SMAT} and 𝖠𝖧𝖠𝖳\mathsf{AHAT} (with struct future masking and positional encodings) is known (Huang et al., 2025). Similarly, 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} is not expressible by 𝖲𝖬𝖠𝖳\mathsf{SMAT} with strict future masking (Hahn, 2020). Corollary˜4.2 complements these results and is an easy corollary of Proposition˜4.1 (see Section˜A.3).

5 Applications

5.1 Universality and undecidability of transformers

Let us discuss why universality/undecidability (i.e. Theorems˜1.3 and 1.4) follow from Theorem˜1.2. First, by the well-known theorem “MRDP” theorem (Matiyasevich, 1993) due to Matiyasevich, Robinson, Davis, and Putnam, every language in 𝖱𝖤𝖯𝖨\mathsf{RE}\cap\mathsf{PI} is a projection of a language of the form Lp={w{𝚊1,,𝚊m}p(Ψ(w))=0}L_{p}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid p(\Psi(w))=0\}, where p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] is a polynomial. Since LpL_{p} belongs to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}], we thus obtain Theorem˜1.3. Furthermore, since our translation from polynomials to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] (and thus 𝖲𝖬𝖠𝖳\mathsf{SMAT}) is effective, this also implies Theorem˜1.4: By the MRDP theorem (which is also effective), it is undecidable whether a given polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] has a solution. Using our translations, we can turn such a pp into a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} (or 𝖲𝖬𝖠𝖳\mathsf{SMAT}) that is non-empty if and only if pp has a solution.

Using only two layers

In fact, in Theorems˜1.3 and 1.4, we even claim that two layers suffice for universality and undecidability. Let us sketch this here. First, our construction above yields a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] of at most \ell layers, provided that the polynomials in the semialgebraic set all have degree \leq\ell (see Appendix˜A). In particular, we show that for each \ell, 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[,𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\ell,U}] is closed under union and intersection (see Section˜A.2). Furthermore, we rely on the well-known fact that the set of solutions of a polynomial equation p=0p=0 can always be written as the projection of the set of solutions of a system of quadratic equations. Since by our stronger version of Theorem˜1.2, intersections of solution sets of quadratic equations only require a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] with 2\leq 2 layers, this yields the stronger versions of Theorems˜1.3 and 1.4. See Appendix˜B for details (where we also show that with just one layer, Theorems˜1.3 and 1.4 do not hold).

5.2 Comparison with C-RASP and LTL with Counting

C-RASP (Huang et al., 2025; Yang and Chiang, 2024) is a simple programming language that can be converted into softmax transformers. In particular, it is a subset of the so-called LTL with Counting (Yang and Chiang, 2024; Barceló et al., 2024). For example, {w{a,b}:|w|a=|w|b}\{w\in\{a,b\}^{*}:|w|_{a}=|w|_{b}\} can be written as the following formula in LTL with Counting: #a=#b\overrightarrow{\#a}=\overrightarrow{\#b}. In particular, only linear expressions can be constructed in such formulas. We show in the appendix that LTL with Counting (and therefore C-RASP) only capture (semi)linear counting properties, i.e., boolean combinations of linear inequalities (and modulo arithmetics), so not languages like Lk:={w{a,b}:|w|ak|w|b}L_{k}:=\{w\in\{a,b\}:|w|_{a}^{k}\geq|w|_{b}\}.

Proposition 5.1.

LTL with Counting can define only (semi)linear counting properties.

6 Experiments

In this section, we experimentally complement our main result (cf. Theorem˜1.1) that transformers can capture solutions of polynomial equations of higher degree. In particular, our results suggest that softmax transformers should be able to learn languages encoding solutions of polynomial equations.

We test our hypothesis on extensions of 𝖬𝖠𝖩\mathsf{MAJ} with polynomial inequalities. That is, we define the language LkL_{k} is defined by Lk={w{a,b}+|w|b(|w|a)k}L_{k}=\{w\in\{a,b\}^{+}\mid|w|_{b}\leq(|w|_{a})^{k}\}, representing the set of solutions for the simple equation yxky\geq x^{k}.

Do softmax transformer classifiers perform well on language LkL_{k}? Additionally, can we observe tendencies of length-generalization?

In other words, the task of the transformer is a binary classification such that T(w)T(w) accepts if wLkw\in L_{k} and it does not if wLkw\not\in L_{k}.

We train softmax encoders without positional encoding and otherwise in line with the vanilla model, introduced by Vaswani et al. (2017), as binary classifiers using components offered by Pytorch’s nn.Module based on a balanced dataset of 51055\cdot 10^{5} data points sampled from LkL_{k} for k=1,,5k=1,\dotsc,5 of words up to length 500 In all experiments, we conduct a single epoch and choosed the best model conducting early stopping based on the binary-cross entropy loss combined with softmax, the typical metric for models outputting a probability for binary classification, offered in a numerical stable version by Pytorch’s nn.Module in form of BCEWithLogitsLoss , on a validation dataset sampled from the same distribution and of the same size as the training dataset. To partially explore the hyperparameter space, we conduct a grid search over number of layers 1 to 5, number of heads per layer 1, 2 or 4. In all experiments, we fixed the input features to 32, the feedforward dimension to 64, the dropout rate to 0.3, and optimized using the AdamW optimizer with a learning rate of 10410^{-4} and weight decay of 0.01 as, again, offered by Pytorch’s optim package.

kk Val. Perf. Test Perf. Gen. Perf.
1 0.015 0.016/0.99 0.301/0.95
2 0.024 0.033/0.99 0.324/0.94
3 0.023 0.021/0.99 0.299/0.96
4 0.019 0.020/0.99 0.099/0.97
5 0.020 0.024/0.99 0.107/0.96
1234510010^{0}10110^{-1}10210^{-2}10310^{-3}kkLossVal. Perf.Test Perf.Gen. Perf.
Figure 2: Performance of softmax transformer classifiers for LkL_{k} (k=1k=1 to 55). Validation Performance (Val. Perf.): BCEWithLogitsLoss on validation data. Test Performance (Test Perf.): BCEWithLogitsLoss and Accuracy (separated by /) on test data. Generalization Performance (Gen. Perf.): BCEWithLogitsLoss and Accuracy (separated by /) on generalization test set. The y-axis uses a logarithmic scale to accommodate the different orders of magnitude in the results.

Figure 2 presents the outcome of our experiments. The table on the left-hand side demonstrates the best observed performance on the validation dataset (first column), a balanced test dataset derived from the same distribution as the training and validation data (second column). This specifically implies that this dataset also only includes words of length up to 500. The final column represents another balanced test dataset encompassing words from length 501 to 1000, used to potentially unveil some length generalization performance. The plot on the right visualizes the same results.

Generally, we observe very high performance with an accuracy of 0.99\geq 0.99 on the in-distribution test dataset. Additionally, while the performance on the test dataset with longer words decreases, it remains relatively high, with an accuracy of 0.94\geq 0.94 in all instances. Especially, it is to be assumed that with a more extensive experimental setup, this gap in performance will decrease. Therefore, we infer that our trained encoders perform well and that length generalization is supported, indicating that the model can capture the semantics of LkL_{k}. In Appendix D we report additional results, showing strong performance, with a decrease in performance on longer inputs.

7 Concluding Remarks

Related Work.

Lots of work have been done in recent years on the expressiveness of transformers for general (not necessarily counting) properties (cf. see (Strobl et al., 2024)). Counting properties — e.g., the languages 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} and 𝖬𝖠𝖩\mathsf{MAJ} — have frequently featured in transformers expressivity research, which highlight their importance. Various theoretical transformer models have been used in the literature employing different assumptions on the attention mechanisms (hardmax attention vs. softmax attention), positional encodings, etc. For example, a large proportion of results use hardmax attention, which is not used by practical transformers (which instead use softmax attention). In addition, some works (e.g. Pérez et al. (2021); Barceló et al. (2024)) employ extremely complex positional encodings with no restrictions. That said, several recent works have adopted more practical models. In particular, the works of Yang and Chiang (2024); Huang et al. (2025); Yang et al. (2024b; 2025) employ softmax attention transformers and simple classes of positional encodings (causal masking, local, etc.). Our results also employ a similar model (AHAT[U] and SMAT); in fact, we proved that semialgebraic counting properties can be captured by transformers without any positional encodings. Yang et al. (2025) gave a restriction of softmax attention transformers with bounded finite precision outside the attention computation, which characterizes C-RASP. Our experimental results seem to suggest this transformer model only lower-bounds the expressivity of real-world transformers, which can capture counting properties beyond C-RASP.

Concerning verification of transformers, we mention the works by Yang et al. (2024a) and Bergsträßer et al. (2026), showing that reasoning about Unique-Hard Attention Transformers (UHAT) are decidable with complexity EXPSPACE-complete. UHAT is known to overapproximate what can be captured by softmax transformers with bounded finite precision (Li and Cotterell, 2025). We also mention the recent work (Yang et al., 2026), showing that verifying C-RASP is undecidable.

Potential Applications in NLP.

By Weierstrass theorem, polynomials can approximate any continuous function of the number of occurrences of tokens. This suggests that transformers can solve practical NLP tasks that require computation of nonlinear statistics in the word frequencies.

Counting properties are tightly connected to Vector Space Model (VSM) (Salton et al., 1975; Wong et al., 1985; Shahmirzadi et al., 2019) that has applications in text classification and similarity analysis, where the standard method has been to employ Support Vector Machines (SVM), together with kernel analysis (e.g. using polynomial kernels). Our results imply that transformers are expressive enough to perform such tasks. In VSM, a document DD is a vector vDv_{D} indexed by “terms” that may occur in DD. That is, vD[t]v_{D}[t] is a count on the number of occurrences of tt in DD. To compare similarity between two documents D,DD,D^{\prime}, we may consider the Euclidean distance between vDv_{D} and vDv_{D^{\prime}}, which requires a polynomial. Also, there are often challenges including "related terms" (e.g. husband, wife, and spouse), which are missed when we only use the aforementioned metric. Thus, a similarity measure is often learned (see Section 10.2.2 in (Shawe-Taylor and Cristianini, 2004), where VSM is used in combination with polynomial kernels). Our results show that transformers can solve such a task. A related task is the problem of determining proximity to a human written text, as dictated by Zipf (1935) stating that the frequency of the kk-th most frequent word is proportional to 1/k1/k in a natural language. As above, we may compare using Euclidean distance a document DD with a predetermined Zipf-vector. This results in a polynomial, and our results show this can be captured by transformers.

Future Work.

We mention several open problems. Firstly, can softmax attention transformers with causal masking capture counting properties beyond semialgebraic sets? Secondly, our work has identified a gap in the formalization of the RASP-L conjecture by Huang et al. (2025). That is, transformers can capture and efficiently learn semialgebraic counting properties, which are beyond the language C-RASP. It is open whether the extension of C-RASP with inequalities over nonlinear polynomials can still be captured by softmax transformers.

Acknowledgments

We thank David Chiang, Michael Hahn, Andy Yang, and anonymous reviews for their feedback.

[Uncaptioned image] Marco Sälzer, Chris Köcher, Georg Zetzsche, and Anthony Lin are funded by the European Union (ERC, LASD, 101089343 and FINABIS, 101077902). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Alexander Kozachinskiy is funded by the National Center for Artificial Intelligence CENIA (FB210017, Basal ANID, and ANID Fondecyt Iniciación grant 11250060).

References

  • C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, E. Dyer, and B. Neyshabur (2022) Exploring length generalization in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), External Links: Link Cited by: §1.
  • P. Barceló, A. Kozachinskiy, A. W. Lin, and V. V. Podolskii (2024) Logical languages accepted by transformer encoders with hard attention. In ICLR, Cited by: §1, §2.2, §4, §5.2, §7.
  • P. Bergsträßer, R. Cotterell, and A. W. Lin (2026) Transformers are inherently succinct. In The Fourteenth International Conference on Learning Representations, External Links: Link Cited by: §7.
  • S. Bhattamishra, K. Ahuja, and N. Goyal (2020) On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 7096–7116. External Links: Link, Document Cited by: §1.
  • D. Chiang and P. Cholak (2022) Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio (Eds.), pp. 7654–7664. External Links: Link, Document Cited by: §1, §4.
  • D. Chistikov (2024) An introduction to the theory of linear integer arithmetic (invited paper). In 44th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2024, December 16-18, 2024, Gandhinagar, Gujarat, India, S. Barman and S. Lasota (Eds.), LIPIcs, Vol. 323, pp. 1:1–1:36. External Links: Document Cited by: Appendix B, §C.1.
  • G. Delétang, A. Ruoss, J. Grau-Moya, T. Genewein, L. K. Wenliang, E. Catt, C. Cundy, M. Hutter, S. Legg, J. Veness, and P. A. Ortega (2023) Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, External Links: Link Cited by: §1.
  • C. Haase (2018) A survival guide to presburger arithmetic. ACM SIGLOG News 5 (3), pp. 67–82. External Links: Document Cited by: Appendix B, §C.1.
  • M. Hahn and M. Rofin (2024) Why are sensitive functions hard for transformers?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V. Srikumar (Eds.), pp. 14973–15008. External Links: Link, Document Cited by: §1.
  • M. Hahn (2020) Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics 8, pp. 156–171. External Links: Link, Document Cited by: §1, §1, §1, §4.
  • Y. Hao, D. Angluin, and R. Frank (2022) Formal language recognition by hard attention transformers: perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics 10, pp. 800–810. External Links: Link, Document Cited by: §1, §1, §1.
  • Z. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. External Links: Document, Link Cited by: §1.
  • X. Huang, A. Yang, S. Bhattamishra, Y. R. Sarrof, A. Krebs, H. Zhou, P. Nakkiran, and M. Hahn (2025) A formal framework for understanding length generalization in transformers. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, External Links: Link Cited by: Appendix D, §1, §1, §1, §4, §5.2, §7, §7, Remark.
  • J. Li and R. Cotterell (2025) Characterizing the expressivity of fixed-precision transformer language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §7, Remark.
  • Y. V. Matiyasevich (1993) Hilbert’s tenth problem. MIT Press, Cambridge, Massachusetts. Cited by: Appendix B, §1, §5.1.
  • W. Merrill and A. Sabharwal (2023a) A logic for expressing log-precision transformers. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), External Links: Link Cited by: Remark.
  • W. Merrill and A. Sabharwal (2023b) The parallelism tradeoff: limitations of log-precision transformers. Trans. Assoc. Comput. Linguistics 11, pp. 531–545. External Links: Link, Document Cited by: Remark.
  • J. Pérez, P. Barceló, and J. Marinkovic (2021) Attention is turing-complete. J. Mach. Learn. Res. 22, pp. 75:1–75:35. External Links: Link Cited by: §1, §1, §2.2, §7.
  • G. Salton, A. Wong, and C. Yang (1975) A vector space model for automatic indexing. Commun. ACM 18 (11), pp. 613–620. External Links: Link, Document Cited by: §1, §1, §7.
  • M. Sälzer, E. Alsmann, and M. Lange (2025) Transformer encoder satisfiability: complexity and impact on formal reasoning. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, External Links: Link Cited by: §1, Remark.
  • O. Shahmirzadi, A. Lugowski, and K. Younge (2019) Text similarity in vector space models: A comparative study. In 18th IEEE International Conference On Machine Learning And Applications, ICMLA 2019, Boca Raton, FL, USA, December 16-19, 2019, M. A. Wani, T. M. Khoshgoftaar, D. Wang, H. Wang, and N. Seliya (Eds.), pp. 659–666. External Links: Link, Document Cited by: §1, §7.
  • J. Shawe-Taylor and N. Cristianini (2004) Kernel methods for pattern analysis. illustrated edition edition, Cambridge University Press. External Links: ISBN 0521813972, Link Cited by: §1, §1, §1, §7.
  • M. Sipser (2013) Introduction to the theory of computation. Third edition, Course Technology, Boston, MA. External Links: ISBN 113318779X Cited by: §2, §2.
  • L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin (2024) What formal languages can transformers express? A survey. Trans. Assoc. Comput. Linguistics 12, pp. 543–561. External Links: Link, Document Cited by: §1, §7.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §6.
  • S. K. M. Wong, W. Ziarko, and P. C. N. Wong (1985) Generalized vector space model in information retrieval. In Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, Montréal, Québec, Canada, June 5-7, 1985, J. Tague (Ed.), pp. 18–25. External Links: Link, Document Cited by: §1, §1, §7.
  • A. Yang, P. Bergsträßer, G. Zetzsche, D. Chiang, and A. Lin (2026) Length generalization bounds for transformers. Note: Under submission (preprint: https://zenodo.org/records/18800700) External Links: Document Cited by: §7.
  • A. Yang, M. Cadilhac, and D. Chiang (2025) Knee-deep in c-RASP: a transformer depth hierarchy. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §7.
  • A. Yang, D. Chiang, and D. Angluin (2024a) Masked hard-attention transformers recognize exactly the star-free languages. In Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37, pp. 10202–10235. External Links: Link Cited by: §7, Remark.
  • A. Yang and D. Chiang (2024) Counting like transformers: compiling temporal counting logic into softmax transformers. CoRR abs/2404.04393. Cited by: §1, §1, §5.2, §7, Remark.
  • A. Yang, L. Strobl, D. Chiang, and D. Angluin (2024b) Simulating hard attention using soft attention. CoRR abs/2412.09925. External Links: Link, Document, 2412.09925 Cited by: §1, §7.
  • H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran (2024) What algorithms can transformers learn? A study in length generalization. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, External Links: Link Cited by: §1.
  • G. K. Zipf (1935) The psychobiology of language: an introduction to dynamic philology. Houghton Mifflin, Boston, MA. Cited by: §7.

Appendix A Translating semialgebraic sets to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT}

A.1 Fine-grained analysis of polynomial degree vs. depth

In this subsection, we show the inclusion 𝖲𝖾𝗆𝗂𝖠𝗅𝗀𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{SemiAlg}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}]. In fact, we show a stronger statement (Proposition˜A.1), which requires some notation. By 𝖲𝖾𝗆𝗂𝖠𝗅𝗀[]\mathsf{SemiAlg}[\leq\ell] we denote the restriction of the class 𝖲𝖾𝗆𝗂𝖠𝗅𝗀\mathsf{SemiAlg} to the semi-algebraic languages LΣL\subseteq\Sigma^{*} such that the underlying semi-algebraic set SmS\subseteq\mathbb{N}^{m} is a Boolean combination of sets SpS_{p} where p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] are polynomials of degree \leq\ell. In particular, we have 𝖲𝖾𝗆𝗂𝖠𝗅𝗀[1]=𝖰𝖥𝖯𝖠\mathsf{SemiAlg}[\leq 1]=\mathsf{QFPA}. Our construction for 𝖲𝖾𝗆𝗂𝖠𝗅𝗀𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{SemiAlg}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] actually shows the following:

Proposition A.1.

For each >0\ell>0 we have 𝖲𝖾𝗆𝗂𝖠𝗅𝗀[]𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[,𝖴]\mathsf{SemiAlg}[\leq\ell]\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq\ell,U}].

For showing Proposition˜A.1, we need some more technical definitions. Let TT be an AHAT with input embedding ι:Σd1\iota\colon\Sigma\to\mathbb{Q}^{d_{1}} and layers λ1:(d1)(d2),,λ:(d)(d+1)\lambda_{1}\colon(\mathbb{Q}^{d_{1}})^{*}\to(\mathbb{Q}^{d_{2}})^{*},\ldots,\lambda_{\ell}\colon(\mathbb{Q}^{d_{\ell}})^{*}\to(\mathbb{Q}^{d_{\ell+1}})^{*}. We define the function fT:Σ+f_{T}\colon\Sigma^{+}\to\mathbb{Q} as follows: for a word w=a1a2anΣ+w=a_{1}a_{2}\ldots a_{n}\in\Sigma^{+}, if λ1λ(ι(a1),,ι(an))=(𝒚1,,𝒚n)\lambda_{1}\circ\cdots\circ\lambda_{\ell}(\iota(a_{1}),\ldots,\iota(a_{n}))=(\bm{y}_{1},\ldots,\bm{y}_{n}), then fT(w)=𝒚n[1]f_{T}(w)=\bm{y}_{n}[1]. In other words, we have fT(w)>0f_{T}(w)>0 iff T(w)=1T(w)=1.

Proposition A.2.

For every polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] of degree \ell, the language Lp>0={w{𝚊1,,𝚊m}p(Ψ(w))>0}L_{p>0}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid p(\Psi(w))>0\} belongs to 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[,𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq\ell,U}].

To show Proposition˜3.1, we will use polynomials that are homogeneous, meaning all monomials have the same degree. Note that given an arbitrary polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] of degree \ell, we can consider the polynomial q[X0,,Xm]q\in\mathbb{Z}[X_{0},\ldots,X_{m}] with q=X0dp(X1X0,,XmX0)q=X_{0}^{d}p(\tfrac{X_{1}}{X_{0}},\ldots,\tfrac{X_{m}}{X_{0}}), which is homogeneous. It has the property that p(x1,,xm)>0p(x_{1},\ldots,x_{m})>0 if and only if q(1,x1,,xm)>0q(1,x_{1},\ldots,x_{m})>0. Therefore, from now on, we assume that we have a homogeneous polynomial q[X0,,Xm]q\in\mathbb{Z}[X_{0},\ldots,X_{m}] and want to construct an AHAT for the language Kq={w{𝚊1,,𝚊m}q(1,𝒙)>0 for 𝒙=Ψ(w)}K_{q}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid\text{$q(1,\bm{x})>0$ for $\bm{x}=\Psi(w)$}\}.

To simplify notation, we denote the end marker $\mathdollar by 𝚊0\mathtt{a}_{0}. Thus, the input will be a string w{𝚊0,,𝚊m}+w\in\{\mathtt{a}_{0},\ldots,\mathtt{a}_{m}\}^{+} that contains 𝚊0\mathtt{a}_{0} exactly once, at the end. Since |w|𝚊0=1|w|_{\mathtt{a}_{0}}=1 is satisfied automatically, our AHAT only has to check that q(x0,,xm)>0q(x_{0},\ldots,x_{m})>0, where xi=|w|𝚊ix_{i}=|w|_{\mathtt{a}_{i}}. The input encoding is the map {𝚊0,,𝚊m}m\{\mathtt{a}_{0},\ldots,\mathtt{a}_{m}\}^{*}\to\mathbb{Q}^{m} with 𝚊i𝒆i\mathtt{a}_{i}\mapsto\bm{e}_{i}, where 𝒆im\bm{e}_{i}\in\mathbb{Q}^{m} is the ii-th unit vector.

In a first lemma we show that each monomial of qq can be computed by a NoPE-AHAT with \ell uniform attention layers.

Lemma A.3.

For every monomial r[X0,X1,,Xm]r\in\mathbb{Z}[X_{0},X_{1},\ldots,X_{m}] of degree \ell, there is a NoPE-AHAT TT with \ell uniform attention layers such that

fT(w)=r(Ψ(w))|w|f_{T}(w)=\frac{r(\Psi(w))}{|w|^{\ell}}

for each word wΣw\in\Sigma^{*}. In particular, we have fT(w$)>0f_{T}(w\mathdollar)>0 if and only if r(Ψ(w))>0r(\Psi(w))>0.

Proof.

We use the word embedding ι:Σm+1\iota\colon\Sigma\to\mathbb{Q}^{m+1} with ι(𝚊i)=𝒆i\iota(\mathtt{a}_{i})=\bm{e}_{i} for each i[0,m]i\in[0,m].

Step I: Compute frequencies

Our AHAT first uses an attention layer to compute m+1m+1 new components, where ii-th component holds xin+1\tfrac{x_{i}}{n+1}, where n+1n+1 is the length of the input (including the end marker). This is easily done by attending to all positions and computing the averages of the first m+1m+1 components. To simplify notation, we will index vectors starting with index 0.

Step II: Multiplication gadgets

Second, we have a sequence of gadgets (each consisting of one uniform attention layer and one ReLU layer). Each gadget introduces a new component, and does not change the existing components. Between gadget executions, the following additional invariants are upheld: (i) Overall, a gadget does not change existing components: it introduces one new component. (ii) The components {0,,m}\{0,\ldots,m\} are called the initial components. (iii) All other components are uniform, i.e. they are the same across all positions. (iv) The uniform components carry values in [0,1][0,1]. Thus, we will call components 0,,m0,\ldots,m the initial components; and we call components >m>m the uniform components.

Our gadgets do the following. Suppose we have already produced kk additional components. For each initial component i[0,m]i\in[0,m] and uniform component j[m+1,m+1+k]j\in[m+1,m+1+k], gadget 𝗈𝗆𝗎𝗅𝗍(k,i,j)\mathsf{omult}(k,i,j), which introduces a new component, will carry the value xiyjn+1,\frac{x_{i}\cdot y_{j}}{n+1}, where yjy_{j} is the value in component jj of all vectors. Recall that we use xix_{i} to denote the number of 𝚊i\mathtt{a}_{i} occurrences in the input for i[0,m]i\in[0,m].

We implement the gadget 𝗈𝗆𝗎𝗅𝗍(k,i,j)\mathsf{omult}(k,i,j) using some ReLU layers and an attention layer. Suppose that before, we have the vector 𝒖pm+1+k\bm{u}_{p}\in\mathbb{Q}^{m+1+k} in position pp. First, using ReLU layers, we introduce a new component that in position pp has the value 𝒖p[i]𝒖p[j]\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]. This can be achieved since 𝒖p[i]\bm{u}_{p}[i] is in {0,1}\{0,1\} and 𝒖p[j][0,1]\bm{u}_{p}[j]\in[0,1]: Notice that 𝒖p[i]𝒖p[j]=ReLU(𝒖p[j](1𝒖p[i]))\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]=\operatorname{ReLU}(\bm{u}_{p}[j]-(1-\bm{u}_{p}[i])). Indeed, if 𝒖p[i]=1\bm{u}_{p}[i]=1, then this evaluates to 𝒖p[j]\bm{u}_{p}[j]; if 𝒖p[i]=0\bm{u}_{p}[i]=0, then we get ReLU(𝒖p[j]1)=0\operatorname{ReLU}(\bm{u}_{p}[j]-1)=0. We then use uniform attention to compute the average of this new 𝒖p[i]𝒖p[j]\bm{u}_{p}[i]\cdot\bm{u}_{p}[j]-component across all vectors. Since there are n+1n+1 vectors, exactly xix_{i} of them have 𝒖p[i]=1\bm{u}_{p}[i]=1, and also 𝒖p[j]=yj\bm{u}_{p}[j]=y_{j}, we get the desired xiyjn+1\tfrac{x_{i}\cdot y_{j}}{n+1}.

Step III: Computing the monomial

We now use our gadgets to compute the value of the monomial. Let r(X0,,Xm)=αXi1Xir(X_{0},\ldots,X_{m})=\alpha\cdot X_{i_{1}}\cdots X_{i_{\ell}}. We use 1\ell-1 gadgets to compute xi1xi/(n+1)x_{i_{1}}\cdots x_{i_{\ell}}/(n+1)^{\ell}: The frequency computation in the beginning yields xi1/(n+1)x_{i_{1}}/(n+1), and then we use gadgets to compute xi1xi2/(n+1)2x_{i_{1}}x_{i_{2}}/(n+1)^{2}, xi1xi2xi3/(n+1)3x_{i_{1}}x_{i_{2}}x_{i_{3}}/(n+1)^{3}, etc. until xi1xi/(n+1)x_{i_{1}}\cdots x_{i_{\ell}}/(n+1)^{\ell}. Finally, we use a ReLU layer to multiply xi1xi/(n+1)x_{i_{1}}\cdots x_{i_{\ell}}/(n+1)^{\ell} with α\alpha. Thus, we have computed r(x0,,xm)/(n+1)r(x_{0},\ldots,x_{m})/(n+1)^{\ell}. ∎

A.2 Combining 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{U}] without additional layers

The following lemma states that two NoPE-AHAT with only uniform attention layers can be parallelized resulting in a NoPE-AHAT with the same number of uniform layers. Their outputs can also be combined via a ReLU neural network. In particular, 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[,𝖴]\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq\ell,U}] is closed under union and intersection.

Lemma A.4.

Let T1,T2T_{1},T_{2} be two NoPE-AHAT with \ell uniform attention layers and let 𝒩\mathcal{N} be a ReLU neural network computing a function 𝒩:2\mathcal{N}\colon\mathbb{Q}^{2}\to\mathbb{Q}. Then there is a NoPE-AHAT T𝒩T_{\mathcal{N}} with \ell uniform attention layers computing fT𝒩(w$)=𝒩(fT1(w$),fT2(w$))f_{T_{\mathcal{N}}}(w\mathdollar)=\mathcal{N}(f_{T_{1}}(w\mathdollar),f_{T_{2}}(w\mathdollar)).

Proof.

The idea of T𝒩T_{\mathcal{N}} is, that it concatenates the components from T1T_{1} with those of T2T_{2} and keeps the sets of components always disjoint. By uniformity we are able to apply the attention layers of T1T_{1} and T2T_{2} in parallel. In the last attention layer we can simply apply 𝒩\mathcal{N} to the first components of T1T_{1} and T2T_{2}.

By ιi:Σd1,i\iota_{i}\colon\Sigma\to\mathbb{Q}^{d_{1,i}} we denote the word embedding of TiT_{i}. From this we construct a new word embedding ι:Σd1,1+d1,2\iota\colon\Sigma\to\mathbb{Q}^{d_{1,1}+d_{1,2}} with ι(𝚊j)=(ι1(𝚊j),ι2(𝚊j))\iota(\mathtt{a}_{j})=(\iota_{1}(\mathtt{a}_{j}),\iota_{2}(\mathtt{a}_{j})) for each j[0,m]j\in[0,m].

Now, let λk,i:dk,idk+1,i\lambda_{k,i}\colon\mathbb{Q}^{d_{k,i}}\to\mathbb{Q}^{d_{k+1,i}} be the kkth layer of TiT_{i} for 1k1\leq k\leq\ell. By Ki,Qi,ViK_{i},Q_{i},V_{i}, and 𝒩i\mathcal{N}_{i} we denote the parameters of λk,i\lambda_{k,i}. Since λk,i\lambda_{k,i} is uniform, the key and query maps KiK_{i} and QiQ_{i} are constantly mapping to zero. We now construct a uniform layer λk:dk,1+dk,2dk+1,1+dk+1,2\lambda_{k}\colon\mathbb{Q}^{d_{k,1}+d_{k,2}}\to\mathbb{Q}^{d_{k+1,1}+d_{k+1},2} composed of λk,1\lambda_{k,1} and λk,2\lambda_{k,2}: the key and query maps KK and QQ still map to zero. If Vi(𝒙i)=Ai𝒙+𝒃iV_{i}(\bm{x}_{i})=A_{i}\bm{x}+\bm{b}_{i} then we define the new value map VV by

V(𝒙1𝒙2)=(A1𝟎𝟎A2)(𝒙1𝒙2)+(𝒃1𝒃2)=(A1𝒙1+𝒃1A2𝒙2+𝒃2)=(V1(𝒙1)V2(𝒙2)).V\binom{\bm{x}_{1}}{\bm{x}_{2}}=\begin{pmatrix}A_{1}&\bm{0}\\ \bm{0}&A_{2}\end{pmatrix}\binom{\bm{x}_{1}}{\bm{x}_{2}}+\binom{\bm{b}_{1}}{\bm{b}_{2}}=\binom{A_{1}\bm{x}_{1}+\bm{b}_{1}}{A_{2}\bm{x}_{2}+\bm{b}_{2}}=\binom{V_{1}(\bm{x}_{1})}{V_{2}(\bm{x}_{2})}\,.

By this definition we obtain that the attention vectors j in λk\lambda_{k} are the concatenation of the attention vectors j,1 and j,2 in λk,1\lambda_{k,1} resp. λk,2\lambda_{k,2}. Similarly, we build the composition of 𝒩1\mathcal{N}_{1} and 𝒩2\mathcal{N}_{2} resulting in an FFN computing (𝒩1(𝒙j,1,j,1)𝒩2(𝒙j,2,j,2))\binom{\mathcal{N}_{1}(\bm{x}_{j,1},_{j,1})}{\mathcal{N}_{2}(\bm{x}_{j,2},_{j,2})}.

Finally, in the last layer, we add the FFN 𝒩\mathcal{N}^{\prime} that takes the first components of the output of 𝒩i(𝒙j,i,j,i)\mathcal{N}_{i}(\bm{x}_{j,i},_{j,i}) and simulates 𝒩\mathcal{N} on these two numbers. ∎

T1:T_{1}\colonι1\iota_{1}0a1\rightsquigarrow\vec{a}_{1}𝒩1\mathcal{N}_{1}T2:T_{2}\colonι2\iota_{2}0a2\rightsquigarrow\vec{a}_{2}𝒩2\mathcal{N}_{2}\RightarrowT:T\colon(ι1ι2)\binom{\iota_{1}}{\iota_{2}}0(a1a2)\rightsquigarrow\binom{\vec{a}_{1}}{\vec{a}_{2}}(𝒩11)\binom{\mathcal{N}_{1}}{1}(1𝒩2)\binom{1}{\mathcal{N}_{2}}
Figure 3: Visualization of the proof of Lemma˜A.4.

Recall that from a polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] we constructed a homogeneous polynomial q[X0,X1,,Xm]q\in\mathbb{Z}[X_{0},X_{1},\ldots,X_{m}] such that p(𝒙)>0p(\bm{x})>0 if and only if q(1,𝒙)>0q(1,\bm{x})>0 holds for all vectors 𝒙m\bm{x}\in\mathbb{Q}^{m}. Let r1,,rk[X0,X1,,Xm]r_{1},\ldots,r_{k}\in\mathbb{Z}[X_{0},X_{1},\ldots,X_{m}] be the monomials in qq. Since qq is homogeneous, all monomials have the same degree \ell. Lemma˜A.3 yields NoPE-AHATs T1,,TkT_{1},\ldots,T_{k} that are computing the monomials rir_{i}. Each of these AHATs has exactly \ell uniform attention layers. Finally, we can apply Lemma˜A.4 to construct a NoPE-AHAT TT with \ell uniform layers computing fT(w$)=q(Ψ(w$))|w$|f_{T}(w\mathdollar)=\frac{q(\Psi(w\mathdollar))}{|w\mathdollar|^{\ell}} (since addition is an affine map). Then TT accepts ww iff q(Ψ(w$))|w$|>0\frac{q(\Psi(w\mathdollar))}{|w\mathdollar|^{\ell}}>0 iff q(Ψ(w$))>0q(\Psi(w\mathdollar))>0 iff p(Ψ(w))>0p(\Psi(w))>0. In other words, TT accepts the language Lp>0L_{p>0}.

A.3 Inexpressibility of 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY}

Proof of Corollary˜4.2.

By Theorem˜1.2, it suffices to show that 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY} is not semi-algebraic. Suppose it is. Then there is a disjunction of conjunctions of polynomial inequalities that characterizes 𝖯𝖠𝖱𝖨𝖳𝖸\mathsf{PARITY}. The polynomials are over [X,Y]\mathbb{Z}[X,Y], where XX is the variable for 𝚊\mathtt{a}’s and YY is the variable for 𝚋\mathtt{b}’s. By plugging in Y=0Y=0, we conclude that the set of even numbers is semi-algebraic. Hence, there is a disjunction i=1nj=1mpi,j(X)>0\bigvee_{i=1}^{n}\bigwedge_{j=1}^{m}p_{i,j}(X)>0 of conjunctions that is satisfied exactly for the even numbers. This implies that for some ii, there are infinitely many even numbers kk such that j=1mpi,j(k)>0\bigwedge_{j=1}^{m}p_{i,j}(k)>0. Therefore, for every j[1,m]j\in[1,m], the leading coefficient of pi,jp_{i,j} must be positive. But then, j=1mpi,j(k)>0\bigwedge_{j=1}^{m}p_{i,j}(k)>0 must hold for all sufficiently large kk, not just the even ones, a contradiction. ∎

Appendix B Parametric analysis

In this section, we study how the expressive power of NoPE-AHAT[U] and SMAT depends on the number of attention layers. In particular, we show that Theorems˜1.3 and 1.4 hold already in the case of two layers. The main insight of this proof is that the number of layers needed to express a semialgebraic set depends on the degrees of the involved polynomials (see Proposition˜A.1): Note that our sketch of an 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} for Lp>0L_{p>0} in Section˜3 directly yields a 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} with \ell layers, where \ell is the degree of pp. For Proposition˜A.1, one then has to show that Boolean combinations of such sets can be expressed without growing the number of attention layers. See Appendix˜A for details.

Capturing 𝖱𝖤\mathsf{RE} with two layers

From Proposition˜A.1, we can now deduce the two-attention-layer version of Theorems˜1.3 and 1.4. The first ingredient is the following version of the MRDP theorem on Diophantine sets (Matiyasevich, 1993):

Theorem B.1.

Let Σ={𝚊1,,𝚊m}\Sigma=\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}. A language LΣL\subseteq\Sigma^{*} belongs to 𝖱𝖤𝖯𝖨\mathsf{RE}\cap\mathsf{PI} if and only if there is a kk\in\mathbb{N} and a polynomial p[X1,,Xm+k]p\in\mathbb{Z}[X_{1},\ldots,X_{m+k}] such that L=π𝚊1,,𝚊m(K)L=\pi_{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}}(K), where

K={w{𝚊1,,𝚊m+k}p(Ψ(w))=0}.K=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m+k}\}^{*}\mid p(\Psi(w))=0\}.

In other words, every language in 𝖱𝖤𝖯𝖨\mathsf{RE}\cap\mathsf{PI} is a projection of a language of the form Lp={w{𝚊1,,𝚊m}p(Ψ(w))=0}L_{p}=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m}\}^{*}\mid p(\Psi(w))=0\}, where p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}] is a polynomial. Thus, it suffices to place LpL_{p} in 𝖯𝗋𝗈𝗃(𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟤,𝖴])\mathsf{Proj}(\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 2,U}]). First observe that in Theorem˜1.2, we use one attention layer for each multiplication, so this avenue is closed if we want to stay within two attention layers. Instead, we use that for every polynomial p[X1,,Xm]p\in\mathbb{Z}[X_{1},\ldots,X_{m}], there are quadratic (i.e. degree 2\leq 2) polynomials q1,,qr[X1,,Xm+k]q_{1},\ldots,q_{r}\in\mathbb{Z}[X_{1},\ldots,X_{m+k}] for some r,k0r,k\geq 0 such that for 𝒙m\bm{x}\in\mathbb{N}^{m}, we have p(𝒙)=0p(\bm{x})=0 if and only if there is some 𝒚k\bm{y}\in\mathbb{N}^{k} with q1(𝒙,𝒚)=0,,qr(𝒙,𝒚)=0q_{1}(\bm{x},\bm{y})=0,\ldots,q_{r}(\bm{x},\bm{y})=0: Just introduce a fresh variable for each multiplication in pp and use the qiq_{i} to assign these fresh variables. Since the language K:={w{𝚊1,,𝚊m+k}q1(Ψ(w))==qr(Ψ(w))}K:=\{w\in\{\mathtt{a}_{1},\ldots,\mathtt{a}_{m+k}\}^{*}\mid q_{1}(\Psi(w))=\cdots=q_{r}(\Psi(w))\} belongs to 𝖲𝖾𝗆𝗂𝖠𝗅𝗀[2]\mathsf{SemiAlg}[\leq 2] (since the qiq_{i} have degree 2\leq 2) and LpL_{p} is a projection of KK, this means LpL_{p} belongs to 𝖯𝗋𝗈𝗃(𝖲𝖾𝗆𝗂𝖠𝗅𝗀[2])\mathsf{Proj}(\mathsf{SemiAlg}[\leq 2]). By Proposition˜A.1, 𝖯𝗋𝗈𝗃(𝖲𝖾𝗆𝗂𝖠𝗅𝗀[2])𝖯𝗋𝗈𝗃(𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟤,𝖴])\mathsf{Proj}(\mathsf{SemiAlg}[\leq 2])\subseteq\mathsf{Proj}(\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 2,U}]).

NoPE AHAT with a single layer

The fact that two layers suffice for universality among counting properties raises the question of whether this is even possible with a single attention layer. We show here that this is not the case:

Theorem B.2.

𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣]=𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣,𝖴]=𝖰𝖥𝖯𝖠\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1}]=\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1,U}]=\mathsf{QFPA}.

This means, with a single attention layer, NoPE-AHAT can recognize precisely those counting properties expressible using quantifier-free Presburger formulas. Since satisfiability of Presburger arithmetic is well-known to be decidable (Haase, 2018; Chistikov, 2024), this implies that universality and undecidability of 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳\mathsf{NoPE\mathchar 45\relax AHAT} (as we have shown for two attention layers), do not hold with just one attention layer. However, we leave open whether 𝖲𝖬𝖠𝖳\mathsf{SMAT} with one attention layer have a decidable emptiness problem.

Before going into details, let us sketch the proof of Theorem˜B.2. For the inclusion 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣]𝖰𝖥𝖯𝖠\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1}]\subseteq\mathsf{QFPA}, we proceed similarly to Proposition˜4.1, while observing that the inequalities we have to verify are all linear inequalities: This is because a single attention layer averages only once. Conversely, for the inclusion 𝖰𝖥𝖯𝖠𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣,𝖴]\mathsf{QFPA}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1,U}] follows easily from Proposition˜A.1.

Proof of Theorem˜B.2.

We begin by proving that 𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[𝟣]𝖰𝖥𝖯𝖠\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq 1}]\subseteq\mathsf{QFPA}. Let TT be an AHAT with input embedding ι:Σ{$}d\iota:\Sigma\cup\{\mathdollar\}\to\mathbb{Q}^{d}, a single AHA layer λ\lambda utilising affine maps Q,Km×dQ,K\in\mathbb{Q}^{m\times d}, Vk×dV\in\mathbb{Q}^{k\times d}, given as matrices, and the ReLU network 𝒩:d+ke\mathcal{N}:\mathbb{Q}^{d+k}\to\mathbb{Q}^{e}. Our goal is to construct a quantifier-free PA formula φT\varphi_{T} with variables xix_{i} for i{1,,|Σ|}i\in\{1,\dotsc,|\Sigma|\} such that Ψ1(φ)={wΣT accepts w$}\Psi^{-1}(\llbracket\varphi\rrbracket)=\{w\in\Sigma^{*}\mid T\text{ accepts }w\mathdollar\}. In the following, we assume Σ={a1,,am}\Sigma=\{a_{1},\dotsc,a_{m}\} and denote Σ{$}\Sigma\cup\{\mathdollar\} by Σ\Sigma^{\prime}.

First, we observe that for all words wΣw\in\Sigma^{*}, the output of TT given w$w\mathdollar is computed by

𝒩(ι($),1|w$|ai1++|w$|aihj=1h|w$|aijVι(aij)),\mathcal{N}\left(\iota(\mathdollar),\frac{1}{|w\mathdollar|_{a_{i_{1}}}+\dotsb+|w\mathdollar|_{a_{i_{h}}}}\sum_{j=1}^{h}|w\mathdollar|_{a_{i_{j}}}V\iota(a_{i_{j}})\right),

where Γ={ai1,,aih}Σ\Gamma=\{a_{i_{1}},\dotsc,a_{i_{h}}\}\subseteq\Sigma^{\prime} is exactly the subset of symbols aija_{i_{j}} occurring in w$w\mathdollar that maximise Qι(aij),Kι($)\langle Q\iota(a_{i_{j}}),K\iota(\mathdollar)\rangle. We construct φT\varphi_{T} such that it mirrors exactly this computational structure. We have φT=ΓΣφΓ\varphi_{T}=\bigvee_{\Gamma\subseteq\Sigma^{\prime}}\varphi_{\Gamma}, where \bigvee ranges over those subsets Γ\Gamma where Qι(aij),Kι($)\langle Q\iota(a_{i_{j}}),K\iota(\mathdollar)\rangle is maximal for precisely the aijΓa_{i_{j}}\in\Gamma. The subformula φΓ\varphi_{\Gamma} is defined as follows. For now, we assume that $Γ\mathdollar\notin\Gamma and introduce some auxiliary formulas. Throughout the following construction steps, we assume that atomic formulas are normalised to the form c1x1++cnxnbc_{1}x_{1}+\dotsb+c_{n}x_{n}\leq b.

Given the ReLU network 𝒩\mathcal{N}, it is straightforward to construct a quantifier-free PA formula φ𝒩\varphi^{\mathcal{N}} such that φ𝒩\llbracket\varphi^{\mathcal{N}}\rrbracket exactly includes those x1,,xd+kd+kx_{1},\dotsc,x_{d+k}\in\mathbb{N}^{d+k} satisfying 𝒩(x1,,xd+k)1>0\mathcal{N}(x_{1},\dotsc,x_{d+k})_{1}>0, where 𝒩()1\mathcal{N}(\cdot)_{1} denotes the first output dimension of 𝒩\mathcal{N}. The key idea here is that the computation of a single ReLU node v(x1,,xd+k)=yv(x_{1},\dotsc,x_{d+k})=y, with weights cic_{i} and bias bb of 𝒩\mathcal{N}, is described by the quantifier-free PA formula: (c1x1++cd+kxd+k+b00=y)(c1x1++cd+kxd+k+b>0c1x1++cd+kxd+k+b=y)(c_{1}x_{1}+\dotsb+c_{d+k}x_{d+k}+b\leq 0\land 0=y)\lor(c_{1}x_{1}+\dotsb+c_{d+k}x_{d+k}+b>0\land c_{1}x_{1}+\dotsb+c_{d+k}x_{d+k}+b=y). Then, by nesting this construction iteratively from the last layer to the first layer of 𝒩\mathcal{N}, and finally replacing =y=y with >0>0 in the atomic formulas related to the first output dimension of 𝒩\mathcal{N}, we achieve the construction of φ𝒩\varphi^{\mathcal{N}}. This nesting and replacement also ensures that φ𝒩\varphi^{\mathcal{N}} includes only the variables x1,,xd+kx_{1},\dotsc,x_{d+k}.

Let ΓΣ\Gamma\subseteq\Sigma such that Γ={ai1,,aih}\Gamma=\{a_{i_{1}},\dotsc,a_{i_{h}}\}. Consider the ReLU network 𝒩\mathcal{N}, the value matrix VV, and the embedding ι\iota. We construct a quantifier-free PA formula φΓ𝒩,V\varphi^{\mathcal{N},V}_{\Gamma} such that φΓ𝒩,V\llbracket\varphi^{\mathcal{N},V}_{\Gamma}\rrbracket exactly includes those (xi1,,xih)h(x_{i_{1}},\dotsc,x_{i_{h}})\in\mathbb{N}^{h} satisfying 𝒩(ι($),1xi1++xihj=1hxijVι(aij))1>0\mathcal{N}(\iota(\mathdollar),\frac{1}{x_{i_{1}}+\dotsb+x_{i_{h}}}\sum_{j=1}^{h}x_{i_{j}}V\iota(a_{i_{j}}))_{1}>0. To do so, we adjust the formula φ𝒩\varphi^{\mathcal{N}} as described in the following. To account for the fixed input ι($)\iota(\mathdollar), we replace each occurrence of x1x_{1} to xdx_{d} in φ𝒩\varphi^{\mathcal{N}} by the respective entry of ι($)\iota(\mathdollar). Furthermore, to handle the specific form of the input 1xi1++xihj=1hxijVι(aij)\frac{1}{x_{i_{1}}+\dotsb+x_{i_{h}}}\sum_{j=1}^{h}x_{i_{j}}V\iota(a_{i_{j}}), we first replace each occurrence of xd+lx_{d+l} with l{1,,k}l\in\{1,\dotsc,k\} in the already modified φ𝒩\varphi^{\mathcal{N}} by:

(vl1ι(ai1)1++vldι(ai1)d)xi1++(vl1ι(aih)1++vldι(aih)d)xih,(v_{l1}\iota(a_{i_{1}})_{1}+\dotsb+v_{ld}\iota(a_{i_{1}})_{d})x_{i_{1}}+\dotsb+(v_{l1}\iota(a_{i_{h}})_{1}+\dotsb+v_{ld}\iota(a_{i_{h}})_{d})x_{i_{h}},

where vljv_{lj} are the respective entries of VV. Lastly, we replace each atomic constraint c1xi1++chxihbc_{1}x_{i_{1}}+\dotsb+c_{h}x_{i_{h}}\leq b in the adjusted formula with (c1b)xi1++(chb)xih0(c_{1}-b)x_{i_{1}}+\dotsb+(c_{h}-b)x_{i_{h}}\leq 0 to adjust for the factor 1xi1++xih\frac{1}{x_{i_{1}}+\dotsb+x_{i_{h}}} present in the input.

Now, we define φΓ\varphi_{\Gamma} as φΓ𝒩,V,ι\varphi^{\mathcal{N},V,\iota}_{\Gamma}. If $Γ\mathdollar\in\Gamma, we adjust φΓ𝒩,V,ι\varphi^{\mathcal{N},V,\iota}_{\Gamma} slightly. Assuming $=aijΓ\mathdollar=a_{i_{j}}\in\Gamma, we replace the variable xijx_{i_{j}} with the constant 11 in φΓ𝒩,V,ι\varphi^{\mathcal{N},V,\iota}_{\Gamma}. Given this construction, it is clear that Ψ1(φT)={wΣ+T accepts w$}\Psi^{-1}(\llbracket\varphi_{T}\rrbracket)=\{w\in\Sigma^{+}\mid T\text{ accepts }w\mathdollar\}, as φT\varphi_{T} mimics the computation of TT for all possible attention situations Γ\Gamma.

For the inclusion 𝖰𝖥𝖯𝖠𝖭𝗈𝖯𝖤𝖠𝖧𝖠𝖳[,𝖴]\mathsf{QFPA}\subseteq\mathsf{NoPE\mathchar 45\relax AHAT}[\mathsf{\leq,U}], we observe that 𝖰𝖥𝖯𝖠𝖲𝖾𝗆𝗂𝖠𝗅𝗀[1]\mathsf{QFPA}\subseteq\mathsf{SemiAlg}[\leq 1], and thus the inclusion follows from Proposition˜A.1. ∎

Appendix C Counting properties expressible by other models

C.1 Semilinear counting properties

A counting property PdP\subseteq\mathbb{N}^{d} is said to be semilinear if can be defined as a boolean combination of inequalities over linear arithmetic expressions (over variables x1,,xdx_{1},\ldots,x_{d} and integer constants) and modulo arithmetic expressions of the form xia(modb)x_{i}\equiv a\pmod{b}, where a,ba,b\in\mathbb{N} are fixed constants. In particular, semilinear counting properties cannot define semialgebraic counting properties involving polynomials of degrees 2 or above.

It is also convenient to use quantifiers when defining semilinear sets. In particular, they do not increase the expressive power since they can be eliminated. This results in the logic called Presburger arithmetic (PA), which refers to the first-order theory of the structure ;+,0,1,<\langle\mathbb{N};+,0,1,<\rangle; see Haase (2018); Chistikov (2024).

C.2 Permutation-invariant languages of LTL with counting

𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\mathsf{LTL}[\mathsf{Count}] has the following syntax:

ϕ\displaystyle\phi ::=att¬ϕϕϕXϕϕUϕ\displaystyle::=a\mid t\leq t\mid\neg\phi\mid\phi\lor\phi\mid\operatorname{X}\phi\mid\phi\operatorname{U}\phi
t\displaystyle t ::=kk#ϕk#ϕt+t\displaystyle::=k\mid k\cdot\overleftarrow{\#\phi}\mid k\cdot\overrightarrow{\#\phi}\mid t+t

where aΣa\in\Sigma and kk\in\mathbb{Z}. Next we define the semantics of 𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\mathsf{LTL}[\mathsf{Count}]. For any word w=a1a2aΣw=a_{1}a_{2}\cdots a_{\ell}\in\Sigma^{*} with a1,a2,,aΣa_{1},a_{2},\ldots,a_{\ell}\in\Sigma, for each 1i1\leq i\leq\ell, and each formula ϕ𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\phi\in\mathsf{LTL}[\mathsf{Count}] we write w,iϕw,i\models\phi if the formula ϕ\phi is satisfied in ww at position ii. Formally, this relation is defined inductively as follows:

  • w,iaw,i\models a (for aΣa\in\Sigma) iff ai=aa_{i}=a,

  • w,i¬ϕw,i\models\neg\phi iff w,i⊧̸ϕw,i\not\models\phi,

  • w,iϕψw,i\models\phi\lor\psi iff w,iϕw,i\models\phi or w,iψw,i\models\psi,

  • w,iXϕw,i\models\operatorname{X}\phi iff i<i<\ell and w,i+1ϕw,i+1\models\phi,

  • w,iϕUψw,i\models\phi\operatorname{U}\psi iff there is ijki\leq j\leq k with w,jψw,j\models\psi and for all ik<ji\leq k<j we have w,kϕw,k\models\phi,

  • w,it1t2w,i\models t_{1}\leq t_{2} iff t1(w,i)t2(w,i)\llbracket t_{1}\rrbracket(w,i)\leq\llbracket t_{2}\rrbracket(w,i) where the semantics t:Σ×\llbracket t\rrbracket\colon\Sigma^{*}\times\mathbb{N}\to\mathbb{Z} of a term tt is defined as follows: k(w,i)=k\llbracket k\rrbracket(w,i)=k, t1+t2(w,i)=t1(w,i)+t2(w,i)\llbracket t_{1}+t_{2}\rrbracket(w,i)=\llbracket t_{1}\rrbracket(w,i)+\llbracket t_{2}\rrbracket(w,i), k#ϕ=k|{1j<iw,jϕ}|\llbracket k\cdot\overleftarrow{\#\phi}\rrbracket=k\cdot|\{1\leq j<i\mid w,j\models\phi\}|, and k#ϕ=k|{ijw,jϕ}|\llbracket k\cdot\overrightarrow{\#\phi}\rrbracket=k\cdot|\{i\leq j\leq\ell\mid w,j\models\phi\}|.

Our main result on 𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\mathsf{LTL}[\mathsf{Count}] is the following:

Theorem C.1.

Every permutation-invariant language definable in 𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\mathsf{LTL}[\mathsf{Count}] has a semilinear Parikh image.

Before we can prove Theorem˜C.1, we need a few more definitions. For an alphabet Σ\Sigma write Σε\Sigma_{\varepsilon} for the set Σ{ε}\Sigma\cup\{\varepsilon\}. A (dd-dimensional) Parikh automaton is a tuple 𝔄=(Q,Σ,ι,Δ,(Cq)qQ)\mathfrak{A}=(Q,\Sigma,\iota,\Delta,(C_{q})_{q\in Q}) where QQ is a finite set of states, Σ\Sigma is the input alphabet, ιQ\iota\in Q is an initial state, ΔQ×Σε×d×Q\Delta\subseteq Q\times\Sigma_{\varepsilon}\times\mathbb{N}^{d}\times Q is a finite transition relation, and CqdC_{q}\subseteq\mathbb{N}^{d} are semilinear target sets. A word wΣw\in\Sigma^{*} is accepted by 𝔄\mathfrak{A} if there are a1,a2,,aΣεa_{1},a_{2},\ldots,a_{\ell}\in\Sigma_{\varepsilon}, states q0,q1,,qQq_{0},q_{1},\ldots,q_{\ell}\in Q, and vectors 𝒗0,𝒗1,,𝒗d\bm{v}_{0},\bm{v}_{1},\ldots,\bm{v}_{\ell}\in\mathbb{N}^{d} such that (i) q0=ιq_{0}=\iota and 𝒗0=𝟎\bm{v}_{0}=\bm{0}, (ii) for each 0i<0\leq i<\ell there is a transition (qi,ai,𝒙i,qi+1)Δ(q_{i},a_{i},\bm{x}_{i},q_{i+1})\in\Delta with 𝒗i+1=𝒗i+𝒙i\bm{v}_{i+1}=\bm{v}_{i}+\bm{x}_{i}, and (iii) 𝒗Cq\bm{v}_{\ell}\in C_{q_{\ell}}. The accepted language L(𝔄)L(\mathfrak{A}) of 𝔄\mathfrak{A} is the set of all words accepted by 𝔄\mathfrak{A}. It is a well-known fact that for each Parikh automaton 𝔄\mathfrak{A} the accepted language L(𝔄)L(\mathfrak{A}) has a semilinear Parikh image. Observe that 0-dimensional Parikh automata are essentially NFA and, hence, accept exactly the regular languages.

A Parikh transducer is a Parikh automaton with input alphabet Σε×Γε\Sigma_{\varepsilon}\times\Gamma_{\varepsilon} where Σ\Sigma and Γ\Gamma are two alphabets. The accepted language L(𝔄)Σ×ΓL(\mathfrak{A})\subseteq\Sigma^{*}\times\Gamma^{*} of a Parikh transducer can also be seen as a map: if (v,w)L(𝔄)(v,w)\in L(\mathfrak{A}) then we can see vv as the input and ww as the output of the transducer. Formally, for an input language LΣL\subseteq\Sigma^{*} a Parikh transducer computes the output T𝔄(L)={wΓvL:(v,w)L(𝔄)}T_{\mathfrak{A}}(L)=\{w\in\Gamma^{*}\mid\exists v\in L\colon(v,w)\in L(\mathfrak{A})\}. If LL is accepted by a Parikh automaton then T𝔄(L)T_{\mathfrak{A}}(L) is also accepted by a Parikh automaton. To see this, we can take the synchronized product of the Parikh automaton 𝔅\mathfrak{B} accepting LL and 𝔄\mathfrak{A} (i.e., 𝔅\mathfrak{B} reads the same letter from the input as 𝔄\mathfrak{A} in its first component). Accordingly, cascading of Parikh transducers is also possible, i.e., if 𝔄\mathfrak{A} and 𝔅\mathfrak{B} are Parikh transducers over Σε×Γε\Sigma_{\varepsilon}\times\Gamma_{\varepsilon} and Γε×Πε\Gamma_{\varepsilon}\times\Pi_{\varepsilon}, we can also construct a Parikh transducer \mathfrak{C} over Σε×Πε\Sigma_{\varepsilon}\times\Pi_{\varepsilon} computing T=T𝔅T𝔄T_{\mathfrak{C}}=T_{\mathfrak{B}}\circ T_{\mathfrak{A}}.

With the definition of Parikh automata and Parikh transducers we are now able to prove Theorem˜C.1.

Proof.

Let ϕ𝖫𝖳𝖫[𝖢𝗈𝗎𝗇𝗍]\phi\in\mathsf{LTL}[\mathsf{Count}] be a formula such that the described language L(ϕ)L(\phi) is permutation-invariant. We will prove by induction on the structure of ϕ\phi that the Parikh image of L(ϕ)L(\phi) (or actually a bounded subset of L(ϕ)L(\phi)) is semilinear. Here, a language LΣL\subseteq\Sigma^{*} is bounded if there are letters a1,a2,,anΣa_{1},a_{2},\ldots,a_{n}\in\Sigma with La1a2anL\subseteq a_{1}^{*}a_{2}^{*}\cdots a_{n}^{*}. So, let a1,a2,,anΣa_{1},a_{2},\ldots,a_{n}\in\Sigma be distinct letters with Σ={a1,a2,,an}\Sigma=\{a_{1},a_{2},\ldots,a_{n}\}. Then L(ϕ)a1a2anL(\phi)\cap a_{1}^{*}a_{2}^{*}\cdots a_{n}^{*} is clearly bounded and has the same Parikh image as L(ϕ)L(\phi).

For each subformula ψ\psi of ϕ\phi we construct a Parikh transducer that labels each position satisfying ψ\psi. In the base case, we decorate each letter aa by 𝒃{0,1}n\bm{b}\in\{0,1\}^{n} where 𝒃[i]=1\bm{b}[i]=1 iff ai=aa_{i}=a. Note that this transducer handles all atomic formulas aΣa\in\Sigma at once. For ψ=χ1χ2\psi=\chi_{1}\lor\chi_{2} we add the decoration b{0,1}b\in\{0,1\} to each letter where b=1b=1 iff one of the decorations corresponding to χ1\chi_{1} and χ2\chi_{2} is 11. There are similar transducers (which do not introduce counters) for the cases ψ=¬χ\psi=\neg\chi, ψ=Xχ\psi=\operatorname{X}\chi, and ψ=χ1Uχ2\psi=\chi_{1}\operatorname{U}\chi_{2}. Note that applying these transducers to a bounded language always yields another bounded language.

Now, consider a counting subformula, i.e. ψ=i=11ki#χi+i=1+12ki#χik\psi=\sum_{i=1}^{\ell_{1}}k_{i}\cdot\overleftarrow{\#\chi_{i}}+\sum_{i=\ell_{1}+1}^{\ell_{2}}k_{i}\cdot\overrightarrow{\#\chi_{i}}\leq k. Observe that the set of positions satisfying ψ\psi is convex in the set of positions satisfying any χi\chi_{i}. This is true since we consider only a bounded input language. Hence, we can split the input word into three (possibly empty) intervals: (i) the positions at the beginning of the input that do not satisfy ψ\psi, (ii) the positions where all positions satisfying a χi\chi_{i} also satisfy ψ\psi, and (iii) the positions at the end of the input that do not satisfy ψ\psi. We describe in the following a Parikh transducer with 323\cdot\ell_{2} many counters - one for each of these three intervals and each formula χi\chi_{i}. The transducer guesses the three intervals (note that this is non-deterministic), counts positions satisfying a χi\chi_{i} accordingly, decorates only the positions in the second interval labeled with a χi\chi_{i} with 11 (and everything else with a 0), and validates in the end our choice of the intervals (via appropriate semilinear target sets ensuring that the equation in ϕ\phi is not satisfied in the first and third interval and is satisfied in the second interval). Clearly, this all can be done in one (non-deterministic) Parikh transducer.

Finally, we have a cascade of (Parikh) transducers decorating each position in a bounded input word with a Boolean value indicating whether ϕ\phi holds in that position. If we use a1a2ana_{1}^{*}a_{2}^{*}\cdots a_{n}^{*} as input language for our transducers (note that this language is regular) and intersect the output with all words decorated with a 11 in the first position, we obtain a Parikh automaton accepting exactly the language L(ϕ)a1a2anL(\phi)\cap a_{1}^{*}a_{2}^{*}\cdots a_{n}^{*}. Since Parikh automata accept only languages with semilinear Parikh image, we infer that L(ϕ)a1a2anL(\phi)\cap a_{1}^{*}a_{2}^{*}\cdots a_{n}^{*} and, hence, L(ϕ)L(\phi) have a semilinear Parikh image. ∎

Appendix D Further experimental validation

i,ji,j Val. Perf. Test Perf. Gen. Perf.
1,3 0.016 0.02/0.99 0.03/0.99
3,2 0.002 0.003/0.99 0.60/0.93
3,3 0.001 0.002/0.99 2.26/0.85
4,2 0.001 0.001/0.99 0.26/0.96
5,1 0.004 0.004/0.99 0.03/0.99
1234510010^{0}10110^{-1}10210^{-2}10310^{-3}kkLossVal. Perf.Test Perf.Gen. Perf.
Figure 4: Performance of softmax transformer classifiers for Li,jL_{i,j} (for a selected set of ii and jj combinations). Validation Performance (Val. Perf.): BCEWithLogitsLoss on validation data. Test Performance (Test Perf.): BCEWithLogitsLoss and Accuracy (separated by /) on test data. Generalization Performance (Gen. Perf.): BCEWithLogitsLoss and Accuracy (separated by /) on generalization test set. The y-axis uses a logarithmic scale to accommodate the different orders of magnitude in the results.

In this section, we report additional experiments addressing a similar research question as posed in Section 6, namely, do softmax transformers perform well on formal languages with inherent non-linear counting properties? Therefore, we consider the language

Li,j={ambncminjm,n1}L_{i,j}=\{a^{m}b^{n}c^{m^{i}n^{j}}\mid m,n\in\mathbb{N}^{\geq 1}\}

for selected values of ii and jj. Clearly, recognising this language requires non-linear counting capabilities. Moreover, in contrast to LkL_{k} (see Section 6), this language poses a greater challenge in learning tasks due to its structure (all bb’s follow all aa’s followed by all cc’s) and larger alphabet size.

The experimental setup is identical to that presented in Section 6. The results are presented in Figure 4 for five distinct combinations of ii and jj. Similar to our previous experiments, the table on the left shows the highest observed performance on the validation dataset (first column) and the best performance on a balanced test dataset derived from the same distribution as the training and validation data (second column). This indicates that this dataset also contains only words of length up to 500. The final column represents another balanced test dataset of words from length 501 to 1000, utilised to potentially reveal length generalisation performance. The plot on the right visualises the results reported in the table.

We again observe very high performance of our trained softmax transformers on the in-distribution test dataset (second column), which shares the same distribution as our training dataset. The performance generally remains high on the generalisation test set (third column) as well. We witnessed a slight decrease compared to the results on the in-distribution test in the case of L3,3L_{3,3} (accuracy of 0.85). A general decrease in performance on longer inputs is expected and also witnessed in other studies (cf. Huang et al. (2025)), but it also indicates that focused studies are essential to reveal rigorous insights into the relationship between the expressibility of polynomial counting properties we established and their practical learnability.

BETA