Tracking the emergence of linguistic structure
in self-supervised models learning from speech

Marianne de Heer Kloots¹, Martijn Bentum², Hosein Mohebbi³,
Charlotte Pouw¹, Gaofei Shen³, Willem Zuidema¹
¹Institute for Logic, Language and Computation, University of Amsterdam, The Netherlands
² Centre for Language Studies, Radboud University, The Netherlands
³ Cognitive Science and Artificial Intelligence, Tilburg University, The Netherlands Corresponding author: m.l.s.deheerkloots@uva.nl

Abstract

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

Refer to caption — Figure 1: (A) We train a set of six models on the same dataset, consisting of 831 hours of Dutch speech recordings. (B) We explore how results vary between model architectures with minimal differences in training set-up (Wav2Vec2, HuBERT-I1, HuBERT-I2). (C) We probe each model’s internal representations for nine types of linguistic structure, which differ in their degrees of abstraction from the acoustic signal and in their timescales of information integration. We compare results for each structure across model layers as well as training steps.

1 Introduction

Much progress in speech technology has been driven by the use of self-supervised learning (SSL) algorithms, which learn powerful representations of spoken language based on unlabelled audio recordings of speech (e.g. Baevski et al., 2020; Hsu et al., 2021; Chen et al., 2022; Poli et al., 2025). As foundation models for the speech signal, speech SSL model (S3M) representations form a crucial component in many state-of-the-art systems for downstream speech technology tasks, including tasks targeting the signal’s language content, like automatic speech recognition (e.g. Omnilingual ASR Team et al., 2025) and spoken language modeling (Arora et al., 2025).

Current S3Ms are typically Transformer architectures optimized on large collections of recordings with relatively simple pre-training objectives (e.g. masked audio segment prediction; see Mohamed et al., 2022, for review). Given these primarily acoustic objectives, it is not obvious that such models should learn to encode more abstract structural information about spoken language. Nevertheless, a substantial amount of work has assessed the linguistic structures encoded in S3M-internal states, often finding that they vastly outperform acoustic baselines when probed for linguistic information, including at the level of phonemes (Martin et al., 2023), words (Pasad et al., 2024) and sentences (Shen et al., 2023).

Speech-based SSL models have also been proposed as candidate systems for the modelling of human language learning and processing, given their ability to operate on more realistic input signals than text-based models (Dupoux, 2018). One challenge that speech poses to both humans and machine learning systems alike is the reliable identification of linguistic units in continuous signals, which themselves vary widely across speakers, contexts, and acoustic conditions (i.e. the lack of invariance problem; Perkell and Klatt, 1986; Heald et al., 2016). While cognitive theories have posited that a sensitivity for symbolic categories is needed to overcome this challenge (e.g. Werker and Tees, 1999; Kuhl, 2004), S3Ms form an interesting counter-perspective to this idea, providing evidence of what linguistic structures can in fact get represented by systems using purely distributional learning mechanisms while operating on continuous signals alone (Schatz et al., 2021; Lavechin et al., 2025).

The linguistic organization of speech importantly follows a hierarchy that can be characterized in terms of both timescale and abstraction: from short-range acoustic features to long-range and abstract syntactic dependencies. How different levels of linguistic structure might interact in human speech perception has long been a central topic of debate: while some have modelled the process of spoken language understanding as a series of relatively encapsulated prelexical, lexical, syntactic and semantic stages (for discussion, see Weber and Scharenborg, 2012; Friederici, 2002; Norris et al., 2016), others emphasize the importance of interactions between structural levels (Magnuson et al., 2018; Marslen-Wilson, 1975; Elman and McClelland, 1988). Similarly, some have characterized the language learning process as a series of successive stages whereby infants first acquire the basic units of speech before they are able to access the higher-level combinations of them (Kuhl, 2000; Werker, 2018), while contrasting views emphasize the role of larger patterns being acquired earlier in learning (Tomasello, 2003; Arnon, 2021), and the use of higher-level structures in learning to identify the basic units (Gleitman, 1990; Feldman et al., 2013).

Do the internal representations of self-supervised speech models reflect the hierarchical organization of linguistic structures? And does it characterize the order in which the different linguistic structures emerge? We here report on a range of interpretability analyses addressing these questions, by investigating model-internal representations across nine levels of linguistic structure. We study the layerwise distribution of these structures as well as their development across training time, for a new set of Wav2Vec2 and HuBERT models trained on 831 hours of spoken Dutch (Figure 1).

2 Related work

2.1 Layerwise hierarchies and linguistic structure in S3Ms

Where different kinds of acoustic, linguistic, and speaker-related information are best encoded across model hidden layer representations has been studied for a wide range of S3M architectures (Pasad et al., 2021, 2023; Mohamed et al., 2024), as well as models fine-tuned for speech-to-text transcription (Pasad et al., 2021; Roll et al., 2026). These studies have generally found a common pattern across Transformer architectures: local acoustics and voice characteristics peak early, whereas labels abstracting away from such lower-level features (e.g. phone or word identity) are best decodable from middle-to-late model layers. Moreover, training objectives greatly affect the layerwise organization of speech features: whereas the encoding of linguistic features peaks in middle layers of SSL-trained Wav2Vec2 and sharply drops off afterwards (reflecting a specialization towards the model’s acoustic pre-training objective), ASR-finetuned Wav2Vec2 preserves such features into its final layers (Pasad et al., 2021).

What kinds of linguistic structures end up encoded in S3M representations has been the focus of a growing complementary body of research, with individual studies typically targeting separate structural levels. Collectively, these studies suggest that S3Ms encode a rich set of linguistic feature structures ranging across multiple timescales and levels of abstraction: from allophonic variation (Choi et al., 2025) to phoneme categories (Martin et al., 2023), phonotactic constraints (de Heer Kloots and Zuidema, 2024), morphological patterns (Gauthier et al., 2025), lexical information (Pasad et al., 2024), and syntactic structure (Shen et al., 2023). When probed for sentence grammaticality across a range of morphosyntactic phenomena, S3M representations even sometimes match or outperform ASR models trained with textual supervision, demonstrating that grammatical knowledge can indeed arise purely from speech (He et al., 2025).

However, because most linguistic investigations into S3M representations have so far focussed on individual structural levels, with analysis data and methods varying between studies, it is currently unknown whether fine-grained distinctions between levels of linguistic organization (e.g. phone-, syllable-, word- and sentence-level structures) are reflected in S3M layerwise hierarchies. Additionally, since most studies focus on the encoding of English linguistic information in models pre-trained on English speech recordings (with some notable exceptions: Millet and Dunbar, 2022; Shen et al., 2024; De La Fuente and Jurafsky, 2024; Dugonjić et al., 2024; de Heer Kloots et al., 2025), it is an open question to what extent S3M encoding of linguistic information at various structural levels generalizes to other languages.

2.2 Learning dynamics in neural models of (spoken) language

The process of learning linguistic units and meaning from data has been a primary topic of investigation since early connectionist modelling (e.g. McClelland and Elman, 1986; Rumelhart and McClelland, 1986; Rogers and McClelland, 2004). Current research in computational linguistics has started investigating learning dynamics in modern neural language models. In the text domain, model behaviors across training have been found to mimic aspects of human word and syntax learning (Chang and Bergen, 2022; Evanson et al., 2023), and measurements of model-internal components can be related to drops in the loss and increased syntactic abilities (Chen et al., 2024).

In the speech domain, the acquisition of language-specific phonetic and lexical structure has been studied in smaller recurrent SSL models (e.g. Poli et al., 2024; Lavechin et al., 2025), as well as in Transformer-based Wav2Vec2 models trained with audiovisual supervision on corpora of spoken image captions (Khorrami et al., 2023), and in Wav2Vec2 models trained only on speech (Orhan et al., 2026). Both Khorrami et al. and Orhan et al. report that phonemic distinctions precede lexical ones in model representations across training, with syntactic structure (Orhan et al., 2026) emerging later, while both studies also demonstrate some degree of overlap between learning trajectories.

Hence, evidence exists that S3Ms learn to encode various levels of linguistic information across training, and that the encodings of different levels of linguistic structure might show distinct patterns across model layers and training trajectories. Nevertheless, both aspects have not yet been systematically studied for a wide range of features (including acoustic, sublexical, lexical and syntactic structures), while simultaneously exploring effects of architectural variations on both layerwise patterns and learning dynamics. Here, we provide such analyses for a set of six Wav2Vec2 and HuBERT models trained on Dutch.

3 Methods

3.1 Models

Architectures

To explore effects of architectural variation on model representational structure, we include two representative systems applied widely in current speech SSL research, which vary minimally in their architectural configuration: Wav2Vec2 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021); see Figure 1B. For both model architectures we use the same base-sized configuration, consisting of a 7-layer CNN module followed by 12 Transformer layers; all randomly initialized before model training. We train six models in total, with two models (initialized with different random seeds) for each of three variations in training objective: Wav2Vec2 models use a contrastive objective with negative examples extracted from the output of its feature encoder, HuBERT-I1 (iteration 1) models use a pseudo-label prediction objective with pseudo-labels extracted by k-means clustering on acoustic (MFCC) features, and HuBERT-I2 (iteration 2) models use a pseudo-label prediction objective with pseudo-labels extracted by k-means clustering the hidden layer representations of the iteration 1 models. We train HuBERT-I1 models for 100K training steps; pseudo-labels for pre-training HuBERT-I2 were extracted from the 6th layer of the final HuBERT-I1 checkpoints. Wav2Vec2 and HuBERT-I2 models were both trained for 200K training steps. HuBERT-I2 models are equivalent to the HuBERT architecture most commonly applied for downstream applications; we choose to study both iterations here, to disentangle observed effects of contrastive vs. predictive training objective from effects of iterative refinement (following Huo and Dunbar, 2025).

Training data, set-up and checkpoints

We use the fairseq library (Ott et al., 2019) to train all Wav2Vec2 and HuBERT models on 831 hours of Dutch speech recordings from various domains (combining data from the corpus of spoken Dutch (Corpus Gesproken Nederlands; CGN), Schuurman et al., 2003; CommonVoice (CV), Ardila et al., 2020; and MultiLingual LibriSpeech (MLS), Pratap et al., 2020). We extract 537 hours of audio from CGN, covering both Dutch and Flemish material while excluding telephone speech and sermons (due to low sample rate and poor audio quality). The remaining dataset contains spontaneous conversations and interviews, along with read speech and news broadcasts. The CGN recordings are segmented into phrases, and only segments between 2 and 15 seconds are used as training inputs. From MLS, we include 211 hours of audiobook segments. An additional 83 hours are sourced form CommonVoice, consisting of short read‑aloud sentences. Across the full training set, audio segment durations range from 2 to 20 seconds. We follow the default training recipes for both Wav2Vec2 and HuBERT, only modifying fairseq configurations to allow longer utterance length and per-device batch size. We used 16 Nvidia A100-40GB GPUs for model training. Training for 100k steps took approximately 16 hours for HuBERT-I1 models; training for 200k steps took approximately 50 hours for HuBERT-I2 models, and 100 hours for Wav2Vec2 models. For all models, we save intermediate checkpoints in increasing intervals throughout training (every checkpoint between 1 and 100 steps, every 10th checkpoint between 100 and 1,000 steps, every 100th checkpoint between 1,000 and 10K steps, every 1,000th checkpoint between 10K and 100K steps, and every 10,000th checkpoint between 100K and 200K steps). For our representational analyses in this study, we choose to include every 1000th checkpoint up to 100K steps, after verifying that capable models were trained and training up to this step was stable for all architectures (see Appendix A; Figure S1, Table S1).

Nonspeech baseline model

In addition to our speech-trained models, we include a baseline comparison model in all our analyses, to ensure that the linguistic structuring detected by our representational probes is in fact a consequence of model training on speech data, rather than a general reflection of input acoustics. As our nonspeech baseline, we use a Wav2Vec2 model trained for 100K steps on 900 hours of non-speech acoustic scenes from AudioSet, for which intermediate training checkpoints have been released by Orhan et al. (2025). This model was verified by Orhan et al. to obtain reasonable performance when fine-tuned on a downstream acoustic validation task (environmental scene classification), and additionally showed some capability for recognizing sequential patterns in non-speech auditory stimuli.

3.2 Representational analyses

We apply a range of analysis techniques targeting different kinds of linguistic structure, probing for model-internal alignment with both categorical (e.g. phone, syllable and word labels) and relational (e.g. continuous acoustic and semantic feature spaces) structures (see Figure 1C).

Analysis data and linguistic annotations

To construct analysis datasets for each of our probes, we first obtain annotations on the level of phones, syllables and words for a subset of recordings in Multilingual LibriSpeech (MLS), which was not included in model training. Taking advantage of the fact that one male Dutch speaker recorded many audiobooks included in MLS, we restrict the subset to this speaker (male_2450) only; by only analyzing recordings from a single speaker, we ensure that results are not confounded by speaker imbalance across analysis recordings. We obtain time-aligned phonetic transcriptions on the level of phones and words by using WebMAUS¹¹1https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSBasic to force-align the audio files and text transcriptions from MLS. For a subset of words we additionally extract syllable segmentations from the CELEX database (Baayen et al., 1996). Finally, for our syntactic analyses (probing for part-of-speech categories and dependency structure), we manually identify a set of 2406 individual sentences (since recordings in MLS are not segmented by sentence), for which we extract part-of-speech (POS) labels and dependency annotations using spaCy (Honnibal et al., 2020); we refer to these annotations, automatically obtained from the sentence text transcriptions, as the ‘true’ POS labels and dependency parses²²2We use pipelines based on nl_core_news_lg, for which the spaCy documentation reports 95% POS tagging accuracy and 88% unlabeled attachment score for dependency parsing..

Embedding extraction

All models included in our study generate representations at a framerate of 20 ms. We extract representations for each linguistic unit (phone, syllable or word) by feature-slicing, i.e. feeding a full audiobook segment from MLS as input and isolating hidden-state activations within the start and end of each unit. Following common practice (e.g. Choi et al., 2024; Pasad et al., 2024), we subsequently mean-pool frame representations within each linguistic unit to obtain phone-, syllable- and word-level embeddings. We analyze each model’s 512-dimensional CNN output, as well as the 768-dimensional projections to Transformer embedding space (embeds) and all Transformer hidden layers (T1-12).

Analysis methods

We aim to measure the linguistic structuring of model-internal representations, but S3M representations are known to entangle many different aspects of the speech signal (including e.g. speaker voice and acoustic characteristics as well as multiple linguistic features). How can we quantify the encoding of specific linguistic information related to our target structures, in such highly entangled representations? One commonly applied method for interpreting model hidden state representations involves the use of diagnostic classification probes, i.e. auxiliary classifiers trained to decode target information from model hidden layer states (Hupkes et al., 2018; Chrupała et al., 2020). However, classification probes have been considered suboptimal for analyzing model learning dynamics, with alternative representation space metrics showing more sensitivity to subtle differences between checkpoints and better alignment to changes in model behavioral performance over training (Saphra and Lopez, 2019). We here use a related range of metrics, which similarly target model representation space structure (i.e. measuring relative distances between input stimuli, clustering, and alignment to interpretable feature spaces), rather than classification accuracy. We analyze a total of nine levels of linguistic structure (Figure 1C), choosing our analysis techniques to suit each categorical and relational linguistic feature structure of interest.

For all probes analyzing categorical linguistic information (i.e. phone categories, syllable forms and types, word forms and part-of-speech labels), we use clustering probes based on dimensionality-reduced subspaces extracted using Linear Discriminant Analysis (LDA; Tharwat et al., 2017). We evaluate model-representational clustering according to target category labels by computing the silhouette score on LDA projections optimized for distinctiveness between categories; silhouette scores range between -1 and 1, where 0 corresponds to random structure and more positive scores indicate more well-separated clusters. Details on all clustering analysis datasets are included in Appendix B; Tables S2, S3, S5, S8, S9.

For analyzing alignment to continuous feature spaces (mel-frequency cepstral coefficients (MFCCs) for acoustic, and static text-based word embeddings for semantic alignment), we use Representational Similarity Analysis (RSA; Kriegeskorte et al., 2008). RSA scores are Pearson’s correlation coefficients between the cosine-dissimilarity spaces extracted from model representation and interpretable feature space, respectively; more positive pearson’s r values indicate higher similarity. For our analyses of semantic alignment, we compute similarity to a set of embeddings from a text-based semantic embedding model (Dutch Fasttext³³3Accessed from the HuggingFace Hub (Wolf et al., 2020); facebook/fasttext-nl-vectors), for which high alignment with human semantic similarity judgments has been reported Brans and Bloem (2026). To avoid confounding effects of part-of-speech on our semantic alignment measure, we run analyses separately within part-of-speech category subsets, and to avoid confounding effects of word identity, we exclude distances between samples of the same word. Further details on feature spaces and analysis subsets are included in Appendix B; Table S7.

To evaluate model-internal disambiguation between homophones, i.e., word pairs with identical pronunciations but different meanings, we construct an ABX task (Schatz, 2016), where word triplets A, B and X are identical in pronunciation (i.e. in their phonetic transcription), but only A and X are also identical in meaning (i.e. in their orthographic transcription⁴⁴4Many Dutch homophone pairs, including the ones in our sample, are pairs of present- vs. past-tense verb forms (e.g. \tipaencoding[amb@lAnd@] aanbelanden / aanbelandden) or singular vs. plural verb or noun forms (e.g. \tipaencoding[briGad@] brigade / brigaden).). Our measure of homophone disambiguation is the binary ABX accuracy on each triplet (1 if AX-similarity $>$ AB-similarity, 0 otherwise); chance accuracy is 0.5.

Finally, we use a structural probe paradigm (Hewitt and Manning, 2019) to investigate how well model-internal representations encode syntactic dependency structure between words. Structural probes were originally introduced for text-based models, and evaluate to what extent linear projections from model embedding space can be trained to reflect distances between words in a sentence that align with distances in that sentence’s syntactic dependency parse. We closely follow the original implementation of the structural probe by Hewitt and Manning, 2019, replacing the token-level hidden-state activations with our word-level speech model embeddings. Encoding of syntactic structure is evaluated by the Undirected Unlabeled Attachment Score (UUAS; the percentage of correctly placed undirected edges), computed between dependency structures reconstructed from speech model representations, and the set of spaCy-annotated true parses. We confirm that our UUAS score measurements reflect some degree of syntactic and not just linear-sequential structure, by comparing results for sequential dependency structures where each word is only linked to its sequential neighbors instead of its syntactic dependants (Appendix B, Figure S3).

Variants of all our probing techniques have been applied for interpreting speech model representations before (e.g. LDA and RSA in Bentum et al., 2025; Sauter et al., 2026; lexical ABX in Algayres et al., 2022; structural probing in Dugonjić et al., 2024; Orhan et al., 2026). All techniques except RSA and ABX involve optimizing auxiliary models to decode target information from model representations; in these cases we evaluate on held-out test sets, designing train-test splits to avoid confounds, and using 5-fold cross-validation. For other cases we simply repeat analyses over 5 folds of data to get an estimate of the variance in scores. We include more details on train-test splits and other probing hyperparameters in Appendix B.

4 Results

4.1 Where are different levels of linguistic structure encoded?

We start by examining the layerwise results of our analyses, including only model checkpoints saved at 100K training steps. We verify that our analysis scores capture meaningful layerwise differences in model-internal linguistic structuring: Figure 2 shows qualitative examples for two probing techniques and accompanying layerwise scores computed over all test samples. Figure 3 shows the layerwise probe scores for three models representing the variation in training objectives (results for the other three models are very similar, and included in Figure S2). We generally observe that all analyzed levels of Dutch linguistic structure are better represented in the models trained on Dutch speech as compared to the nonspeech baseline model.

For Wav2Vec2 and HuBERT-I1, earlier model layers show a sequential pattern of peaks at the level of acoustic, phonetic, and syllabic structure, while lexical and syntactic information is mostly concentrated in the same layer (T7). The relative richness of middle model layers aligns with observations in other, English-trained S3Ms, with the drop at final model layers indicating these layers’ specialization for acoustically oriented training objectives (e.g. Huo and Dunbar, 2025).

HuBERT-I2 shows a markedly different pattern: training on higher-layer I1 pseudo-labels drives the model’s final layers to maintain the higher-level linguistic structures encoded in I1’s sixth-layer representations. Interestingly, more abstract structures (beyond word form) are still represented, but now peak before the phonetic, syllable and word form peaks, dropping off again in the final layer. We further note that for all linguistic probes beyond acoustics, HuBERT-I2’s peak scores exceed those of HuBERT-I1, even though HuBERT-I1’s sixth layer representations are used to derive the prediction targets for HuBERT-I2. This indicates that the higher-level pseudo-labels employed for HuBERT-I2 training (extracted from hidden layers which abstract away from local acoustics, rather than from MFCCs), do stimulate the learning of higher-level linguistic structures, even when such structures are imperfectly encoded in the I1 representations used for pseudo-label extraction (e.g. for the part-of-speech and syntactic structure probes, speech-trained HuBERT-I1 does not outperform the nonspeech baseline model at layer T6).

4.2 When do different levels of linguistic structure become encoded?

The learning trajectories (i.e. the evolution of probe scores across training checkpoints) for all structures and one Wav2Vec2 model are visualized in Figure 4. Observing the general pattern across probes, we note that the encoding of most linguistic structures starts increasing right from the start of model training. The earliest stages of model training seem characterized by learning generally useful representations of speech acoustics: acoustic alignment scores are the first to reach their maximum performance across training, showing a sharp increase in the first 10k training steps, and remaining relatively stable afterwards. Around 10k steps, most linguistic probe results for the speech-trained model start outperforming those of the nonspeech baseline. The syntactic probes (analyzing the encoding of POS categories and syntactic dependency structures) form an exception to

this rule, with speech-trained model scores only starting to outperform the nonspeech model around 25k and 50k steps, respectively.

In the right column of Figure 4, we visualize the development of layerwise patterns across model training (darker colors indicate higher scores). Here, we mainly observe that layerwise peak patterns are relatively stable across model training, emerging between 10k-50k steps and showing no major shifts throughout further training.

To compare the relative order in which linguistic structures at different levels become encoded, we fit parametric curves through the observed probe scores across all analysis checkpoints⁵⁵5We fit the sigmoid function $f(x)=\frac{a}{1+e^{-k(x-b)}}+c$ , using the curve_fit method from the scipy library (Virtanen et al., 2020) to optimize parameters $a$ , $b$ , $c$ and $k$ .. Fitted curves for the Wav2Vec2 model in Figure 4 are visualized as lines through the score datapoints obtained for each checkpoint. Obtaining the learning curves for all probes across models allows us to comprehensively visualize the relative differences between all six models and nine levels of structure (Figure 5). We here observe that the relative differences between learning curves for different probes are generally consistent between different seeds of the same model. The replication of relative learning trajectory patterns between model seeds indicates that distinct levels of structure indeed consistently follow different trajectories. We also observe differences between architectural variations: for Wav2Vec2 and HuBERT-I1, learning trajectories show a similar ordering, with acoustic alignment scores their reaching maximum performance first, followed by a group of phone- syllable- and word-level measures, and finally the part-of-speech and syntactic probes. Conversely, the learning curves of HuBERT-I2 show greater parallelism, diverging from the other models and echoing observed differences in layerwise patterns. The combined pattern of results suggests that HuBERT-I2’s divergent behaviour is not a result of its training objective involving the prediction of categorical labels (which is shared with HuBERT-I1), but rather of the iterative refinement of prediction targets in the HuBERT training process. This complements earlier findings investigating layerwise differences between Wav2Vec2 and HuBERT (Huo and Dunbar, 2025), showing that these differences originate in HuBERT-I2’s training dynamics. Through

pseudo-labels, HuBERT-I2 training is guided by highly compressed information from I1 representations. I1 hidden layers abstract away from acoustics and exhibit some degree of linguistic structuring (Figure 3), and it is the linguistic structures best encoded in those I1 (layer T6) representations which first reach their maximum performance across I2’s training (the phone- and syllable-level scores).

Most probe scores across models appear to reach a ceiling before 100K training steps (Figure 4, Figure 5). Can additional training further improve the encoding of linguistic structures, or do these scores indeed approximate the maximum obtainable performance for speech representations optimized with SSL objectives? To investigate this, we further trained the Wav2Vec2 and HuBERT-I2 models up to 200K steps, and measure probe performance on the final 200K checkpoint for each. In Figure 6 we visualize the results for both Wav2Vec2 and one HuBERT-I2 model⁶⁶6We here exclude HuBERT-I2 seed 1, because it showed instabilities in training beyond 100K steps; see Appendix A..

Across models and probes, score improvements are generally very minimal after further training to 200K steps. Only the structural probe, measuring the encoding of syntactic dependencies, shows consistent improvement across models. It is possible that training beyond 100K steps mainly improves the contextualization of model representations beyond the word level, and that other sentence-level information not analyzed here would also show continued improvement (e.g. semantic sentence similarities; Merkx et al., 2021). The empirical limits of decoding such information from S3Ms remain to be explored; Orhan et al. (2026) report that dependency structures extracted from a Wav2Vec2 model (trained on 900 hours of speech up to 400K steps) approximate those from text-based models in accuracy, demonstrating that S3M representations can indeed learn to encode rich syntactic structures after only SSL training.

5 Discussion & Conclusions

Do S3M layerwise representations and learning dynamics reflect the hierarchical organization of linguistic information in speech? We here developed and applied an analysis suite of representational probes to track the encoding of nine levels of linguistic structure across model-internal layers and intermediate training checkpoints.

The observed layerwise patterns and learning trajectories indeed reflect a hierarchy among linguistic structures, though perhaps mostly dominated by timescale: lexical properties related to form, meaning and grammar jointly peak in the same model layers (Figure 3), and learning of more abstract word properties (semantics, homophone disambiguation) sometimes precedes word form (Figure 5). Timescale and abstraction levels are often related: while syntactic structures are arguably more abstract, their late development might also reflect that the contextualization of model representations beyond the word timescale only starts later in training, as accurate encoding of both POS and dependency structures requires the integration of some beyond-word context.

Learning dynamics of text-based language models reflect successive stages where token probabilities shift from mimicking unigram to higher n-gram statistics over the course of model training (Chang and Bergen, 2022; Michaelov et al., 2025; Jumelet et al., 2026). Speech-based models face an additional challenge in first needing to learn what segments of the continuous speech signal in fact constitute the relevant linguistic units to keep track of. Syntactic information consistently emerges last in training (Figure 5); this could indicate that models indeed first need to develop reasonable capability for distinguishing linguistic units (e.g. word forms) before combinatorial structures relating such units can become encoded. However, this hypothesis should ideally be tested in causal experiments (as performed by Chen et al., 2024 for syntactic learning in text-based models); such experiments could for example consist of suppressing the encoding of word form across training and observing its effects on the emergence of syntactic structure. The results we report here nevertheless contribute observational evidence that motivates such future work.

The higher-level prediction objective of HuBERT-I2 increases parallelism in learning and also substantially affects its layerwise representations: multiple levels of linguistic structure end up jointly encoded in higher layers (Figure 3). Interestingly, the encoding of semantic and syntactic information drops off in HuBERT-I2’s final layers, while the encoding of phone, syllable and word forms remains high. Neuroscience research on human speech processing has shown that the brain keeps track of linguistic information across multiple hierarchical levels simultaneously, such that higher-level information can potentially inform lower-level representations (Heilbron et al., 2022; Gwilliams et al., 2025). Whether the encoding of lower-level linguistic units in HuBERT-I2’s final layers similarly benefits from higher-level information represented in earlier layers remains to be explored. From an engineering perspective, the increased linguistic specialization (and degraded acoustic encoding) of representations in models like HuBERT is not necessarily beneficial for all downstream tasks, as noted in recent work developing novel SSL paradigms for the goal of spoken language modeling (Poli et al., 2025).

We note some limitations to the analyses presented here. While layerwise similarity to semantic text embeddings generally aligns to other metrics of semantic content (Pasad et al., 2021, 2024; Sauter et al., 2026) and we took measures to avoid analysis confounds, it is possible that other non-semantic (sub)word information encoded by the Fasttext model affects our measurements. Similarly, homophones with identical phonetic transcriptions may nevertheless exhibit acoustic differences in pronunciation (Seyfarth et al., 2018). Hence, the obtained semantic alignment and homophone disambiguation scores might reflect less abstract, form-related aspects of words beyond meaning-related information. Furthermore, although existing work has shown that language-specific pre-training benefits the encoding of linguistic information in the training language (Poli et al., 2024; de Heer Kloots et al., 2025; Orhan et al., 2026), we did not explore whether training on Dutch speech specifically (vs. another language) is necessary for all findings observed here.

More detailed studies of linguistic representation learning in speech SSL models, as well as it causal links to model behavior, could potentially inform the development of speech technology as well as (psycho)linguistic theory. We hope that the analyses we presented here contribute a useful starting point for such investigations.

6 Acknowledgements

This work used the Dutch national e-infrastructure with the support of the SURF Cooperative using grants no. EINF-8324 and EINF-15179.
MdHK is funded by the Netherlands Organization for Scientific Research (NWO), through Gravitation Grant 024.001.006 to the Language in Interaction Consortium; MB, HM, CP, GS are funded through NWA-ORC grant NWA.1292.19.399 for ‘InDeep’. We thank three anonymous reviewers for helpful feedback on an earlier version of this manuscript.

References

R. Algayres, T. Ricoul, J. Karadayi, H. Laurençon, S. Zaiem, A. Mohamed, B. Sagot, and E. Dupoux (2022) DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon. Transactions of the Association for Computational Linguistics 10, pp. 1051–1065. External Links: ISSN 2307-387X, Link, Document Cited by: §3.2.
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber (2020) Common Voice: A Massively-Multilingual Speech Corpus. External Links: 1912.06670 Cited by: Table S1, §3.1.
I. Arnon (2021) The Starting Big approach to language learning. Journal of Child Language 48 (5), pp. 937–958 (English). Note: Num Pages: 937-958 Place: Cambridge, United Kingdom Publisher: Cambridge University Press Section: Article External Links: ISSN 03050009, Link, Document Cited by: §1.
S. Arora, K. Chang, C. Chien, Y. Peng, H. Wu, Y. Adi, E. Dupoux, H. Lee, K. Livescu, and S. Watanabe (2025) On The Landscape of Spoken Language Models: A Comprehensive Survey. Transactions on Machine Learning Research (en). External Links: ISSN 2835-8856, Link Cited by: §1.
R. H. Baayen, R. Piepenbrock, and L. Gulikers (1996) The CELEX Lexical Database. (eng). Note: Publisher: University of Pennsylvania External Links: Link Cited by: §3.2.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, Vol. 33, pp. 12449–12460. External Links: Link Cited by: §1, §3.1.
M. Bentum, L. ten Bosch, and T. O. Lentz (2025) Word stress in self-supervised speech models: A cross-linguistic comparison. pp. 251–255. External Links: Link, Document Cited by: §3.2.
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X, Link, Document Cited by: Table S7.
L. Brans and J. Bloem (2026) Multi-SimLex for Dutch: Benchmarking Embedding- and Prompt-Based Model Performance on Semantic Similarity. In LREC 2026, Cited by: §3.2.
M. Brysbaert, M. Stevens, S. De Deyne, W. Voorspoels, and G. Storms (2014) Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica 150, pp. 80–84. External Links: ISSN 0001-6918, Link, Document Cited by: Table S3.
T. A. Chang and B. K. Bergen (2022) Word Acquisition in Neural Language Models. Transactions of the Association for Computational Linguistics 10, pp. 1–16. External Links: ISSN 2307-387X, Link, Document Cited by: §2.2, §5.
A. Chen, R. Shwartz-Ziv, K. Cho, M. L. Leavitt, and N. Saphra (2024) Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs. In ICLR, External Links: Link Cited by: §2.2, §5.
S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei (2022) WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing 16 (6), pp. 1505–1518. External Links: ISSN 1941-0484, Link, Document Cited by: §1.
K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe (2024) Self-Supervised Speech Representations are More Phonetic than Semantic. pp. 4578–4582. External Links: Link, Document Cited by: §3.2.
K. Choi, E. Yeo, K. Chang, S. Watanabe, and D. R. Mortensen (2025) Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and L. Wang (Eds.), Albuquerque, New Mexico, pp. 2613–2628. External Links: ISBN 979-8-89176-189-6, Link, Document Cited by: §2.1.
G. Chrupała, B. Higy, and A. Alishahi (2020) Analyzing analytical methods: the case of phonology in neural models of spoken language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Eds.), Online, pp. 4146–4156. External Links: Link, Document Cited by: §3.2.
M. de Heer Kloots, H. Mohebbi, C. Pouw, G. Shen, W. Zuidema, and M. Bentum (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. In Proc. Interspeech, pp. 256–260. External Links: Link, Document Cited by: §2.1, §5.
M. de Heer Kloots and W. Zuidema (2024) Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0. In Proc. Interspeech, pp. 4593–4597. External Links: Document Cited by: §2.1.
A. De La Fuente and D. Jurafsky (2024) A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models. In Interspeech 2024, pp. 1290–1294 (en). External Links: Link, Document Cited by: §2.1.
Z. Dugonjić, A. Pupier, B. Lecouteux, and M. Coavoux (2024) What has LeBenchmark learnt about French syntax?. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue (Eds.), Torino, Italia, pp. 17493–17499. External Links: Link Cited by: §2.1, §3.2.
E. Dupoux (2018) Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition 173, pp. 43–59 (en). External Links: ISSN 00100277, Link, Document Cited by: §1.
J. L. Elman and J. L. McClelland (1988) Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language 27 (2), pp. 143–165. External Links: ISSN 0749-596X, Link, Document Cited by: §1.
L. Evanson, Y. Lakretz, and J. R. King (2023) Language acquisition: do children and language models follow similar learning stages?. In Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.), Toronto, Canada, pp. 12205–12218. External Links: Link, Document Cited by: §2.2.
N. H. Feldman, T. L. Griffiths, S. Goldwater, and J. L. Morgan (2013) A role for the developing lexicon in phonetic category acquisition. Psychological Review 120 (4), pp. 751–778. External Links: ISSN 1939-1471, Document Cited by: §1.
A. D. Friederici (2002) Towards a neural basis of auditory sentence processing. Trends in Cognitive Sciences 6 (2), pp. 78–84 (English). Note: Publisher: Elsevier External Links: ISSN 1364-6613, 1879-307X, Link, Document Cited by: §1.
J. Gauthier, C. Breiss, M. K. Leonard, and E. F. Chang (2025) Emergent morpho-phonological representations in self-supervised speech models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, pp. 28067–28086. External Links: ISBN 979-8-89176-332-6, Link, Document Cited by: §2.1.
L. Gleitman (1990) The Structural Sources of Verb Meanings. Language Acquisition 1 (1), pp. 3–55. Note: _eprint: https://doi.org/10.1207/s15327817la0101_2 External Links: ISSN 1048-9223, Link, Document Cited by: §1.
L. Gwilliams, A. Marantz, D. Poeppel, and J. King (2025) Hierarchical dynamic coding coordinates speech comprehension in the human brain. bioRxiv (en). External Links: Link, Document Cited by: §5.
L. He, Q. Wang, X. Jiang, and N. Mesgarani (2025) Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, pp. 35338–35353. External Links: ISBN 979-8-89176-332-6, Link, Document Cited by: §2.1.
S. Heald, S. Klos, and H. Nusbaum (2016) Understanding Speech in the Context of Variability. In Neurobiology of Language, pp. 195–208 (en-US). External Links: Link, Document, Document Cited by: §1.
M. Heilbron, K. Armeni, J. Schoffelen, P. Hagoort, and F. P. de Lange (2022) A hierarchy of linguistic predictions during natural language comprehension. Proceedings of the National Academy of Sciences 119 (32), pp. e2201968119. Note: Publisher: Proceedings of the National Academy of Sciences External Links: Link, Document Cited by: §5.
J. Hewitt and C. D. Manning (2019) A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4129–4138. External Links: Link, Document Cited by: §3.2.
M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd (2020) spaCy: Industrial-strength Natural Language Processing in Python. External Links: Document Cited by: §3.2.
W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29, pp. 3451–3460. External Links: ISSN 2329-9290, Link, Document Cited by: §1, §3.1.
R. Huo and E. Dunbar (2025) Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0. pp. 261–265. External Links: Link, Document Cited by: §3.1, §4.1, §4.2.
D. Hupkes, S. Veldhoen, and W. Zuidema (2018) Visualisation and ’Diagnostic Classifiers’ Reveal How Recurrent and Recursive Neural Networks Process Hierarchical Structure. Journal of Artificial Intelligence Research 61, pp. 907–926. External Links: ISSN 1076-9757, Link, Document Cited by: §3.2.
J. Jumelet, L. Bylinina, W. Zuidema, and J. Szymanik (2026) Black Big Boxes: Tracing Adjective Order Preferences in Large Language Models. arXiv. Note: arXiv:2407.02136 [cs] External Links: Link, Document Cited by: §5.
E. Keuleers, M. Brysbaert, and B. New (2010) SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods 42 (3), pp. 643–650 (en). External Links: ISSN 1554-3528, Link, Document Cited by: Table S3, Table S5.
K. Khorrami, M. A. Cruz Blandón, and O. Räsänen (2023) Computational Insights to Acquisition of Phonemes, Words, and Word Meanings in Early Language: Sequential or Parallel Acquisition?. Proceedings of the Annual Meeting of the Cognitive Science Society 45 (45) (en). External Links: Link Cited by: §2.2.
N. Kriegeskorte, M. Mur, and P. A. Bandettini (2008) Representational similarity analysis - connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience 2 (4), pp. 1–28 (English). External Links: ISSN 1662-5137, Link, Document Cited by: §3.2.
P. K. Kuhl (2000) A new view of language acquisition. Proceedings of the National Academy of Sciences 97 (22), pp. 11850–11857. Note: Publisher: Proceedings of the National Academy of Sciences External Links: Link, Document Cited by: §1.
P. K. Kuhl (2004) Early language acquisition: cracking the speech code. Nature Reviews Neuroscience 5 (11), pp. 831–843 (en). External Links: ISSN 1471-0048, Link, Document Cited by: §1.
M. Lavechin, M. de Seyssel, H. Titeux, G. Wisniewski, H. Bredin, A. Cristia, and E. Dupoux (2025) Simulating Early Phonetic and Word Learning Without Linguistic Categories. Developmental Science 28 (2). External Links: ISSN 1363-755X, Link, Document Cited by: §1, §2.2.
J. S. Magnuson, D. Mirman, S. Luthra, T. Strauss, and H. D. Harris (2018) Interaction in spoken word recognition models: feedback helps. Frontiers in psychology 9, pp. 369. External Links: Link Cited by: §1.
W. D. Marslen-Wilson (1975) Sentence Perception as an Interactive Parallel Process. Science (EN). Note: Publisher: American Association for the Advancement of Science External Links: Link Cited by: §1.
K. Martin, J. Gauthier, C. Breiss, and R. Levy (2023) Probing Self-supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration. In INTERSPEECH 2023, pp. 251–255 (en). External Links: Link, Document Cited by: §1, §2.1.
J. L. McClelland and J. L. Elman (1986) The TRACE model of speech perception. Cognitive Psychology 18 (1), pp. 1–86. External Links: ISSN 0010-0285, Link, Document Cited by: §2.2.
B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg, and O. Nieto (2015) Librosa: Audio and Music Signal Analysis in Python. Austin, Texas, pp. 18–24 (en). External Links: Link, Document Cited by: Table S7.
D. Merkx, S. L. Frank, and M. Ernestus (2021) Semantic Sentence Similarity: Size does not Always Matter. pp. 4393–4397. External Links: Link, Document Cited by: §4.2.
J. A. Michaelov, R. P. Levy, and B. Bergen (2025) Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale. (en). External Links: Link Cited by: §5.
T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2018) Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: Table S7.
J. Millet and E. Dunbar (2022) Do self-supervised speech models develop human-like perception biases?. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio (Eds.), Dublin, Ireland, pp. 7591–7605. External Links: Link, Document Cited by: §2.1.
A. Mohamed, H. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S. Li, K. Livescu, L. Maaløe, T. N. Sainath, and S. Watanabe (2022) Self-Supervised Speech Representation Learning: A Review. IEEE Journal of Selected Topics in Signal Processing 16 (6), pp. 1179–1210. Note: Conference Name: IEEE Journal of Selected Topics in Signal Processing External Links: ISSN 1941-0484, Document Cited by: §1.
M. Mohamed, O. D. Liu, H. Tang, and S. Goldwater (2024) Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations. pp. 3625–3629. External Links: Link, Document Cited by: §2.1.
D. Norris, J. M. McQueen, and A. Cutler (2016) Prediction, bayesian inference and feedback in speech recognition. Language, cognition and neuroscience 31 (1), pp. 4–18. External Links: Link Cited by: §1.
Omnilingual ASR Team, G. Keren, A. Kozhevnikov, Y. Meng, C. Ropers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Balioglu, K. Chan, C. Cheng, J. Chuang, C. Droof, M. Duppenthaler, P. Duquenne, A. Erben, C. Gao, G. M. Gonzalez, K. Lyu, S. Miglani, V. Pratap, K. R. Sadagopan, S. Saleem, A. Turkatenko, A. Ventayol-Boada, Z. Yong, Y. Chung, J. Maillard, R. Moritz, A. Mourachko, M. Williamson, and S. Yates (2025) Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages. External Links: Link Cited by: §1.
P. Orhan, Y. Boubenec, and J. King (2025) The detection of algebraic auditory structures emerges with self-supervised learning. PLOS Computational Biology 21 (9), pp. e1013271 (en). External Links: ISSN 1553-7358, Link, Document Cited by: §3.1.
P. Orhan, P. Diego-Simón, E. Chemla, Y. Lakretz, Y. Boubenec, and J. King (2026) Emergence of Phonemic, Syntactic, and Semantic Representations in Artificial Neural Networks. arXiv. External Links: Link, Document Cited by: §2.2, §3.2, §4.2, §5.
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: A Fast, Extensible Toolkit for Sequence Modeling. Minneapolis, Minnesota, pp. 48–53. External Links: Document Cited by: §3.1.
A. Pasad, C. Chien, S. Settle, and K. Livescu (2024) What Do Self-Supervised Speech Models Know About Words?. TACL 12, pp. 372–391. External Links: ISSN 2307-387X, Document Cited by: §1, §2.1, §3.2, §5.
A. Pasad, J. Chou, and K. Livescu (2021) Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 914–921. External Links: Document Cited by: §2.1, §5.
A. Pasad, B. Shi, and K. Livescu (2023) Comparative Layer-Wise Analysis of Self-Supervised Speech Models. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. Note: ISSN: 2379-190X External Links: Link, Document Cited by: §2.1.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. N. g. N. f. ‘InDeep’. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Table S2.
J. S. Perkell and D. H. Klatt (1986) Invariance and Variability in Speech Processes. Psychology Press (en). External Links: ISBN 978-1-317-76829-6 Cited by: §1.
M. Poli, M. Luthra, Y. Benchekroun, Y. Higuchi, M. Gleize, J. Shen, R. Algayres, Y. Chung, M. Assran, J. Pino, and E. Dupoux (2025) SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision. Transactions on Machine Learning Research (en). External Links: ISSN 2835-8856, Link Cited by: §1, §5.
M. Poli, T. Schatz, E. Dupoux, and M. Lavechin (2024) Modeling the initial state of early phonetic learning in infants. Language Development Research 5 (1) (eng). External Links: ISSN 2771-7976, Link, Document Cited by: §2.2, §5.
V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert (2020) MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech, pp. 2757–2761. External Links: 2012.03411, Document Cited by: Table S1, §3.1.
T. T. Rogers and J. L. McClelland (2004) Semantic Cognition: A Parallel Distributed Processing Approach. MIT Press (en). External Links: ISBN 978-0-262-18239-3 Cited by: §2.2.
N. Roll, P. Bhalerao, M. Bartelds, A. Pawar, Y. Tatsumi, T. Ogunremi, C. Shani, C. Graham, M. Sumner, and D. Jurafsky (2026) Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition. arXiv. External Links: Link, Document Cited by: §2.1.
D. E. Rumelhart and J. L. McClelland (1986) On Learning the Past Tenses of English Verbs. In Parallel Distributed Processing, Volume 2: Explorations in the Microstructure of Cognition: Psychological and Biological Models, J.L. McClelland, D.E. Rumelhart, and P. R. Group (Eds.), External Links: Link Cited by: §2.2.
N. Saphra and A. Lopez (2019) Understanding Learning Dynamics Of Language Models with SVCCA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), Minneapolis, Minnesota, pp. 3257–3267. External Links: Link, Document Cited by: §3.2.
A. Sauter, W. Zuidema, and M. d. H. Kloots (2026) The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), External Links: Link, Document Cited by: §3.2, §5.
T. Schatz, N. H. Feldman, S. Goldwater, X. Cao, and E. Dupoux (2021) Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences 118 (7), pp. e2001844118. External Links: Link, Document Cited by: §1.
T. Schatz (2016) ABX-Discriminability Measures and Applications. PhD dissertation, Université Paris 6 (UPMC), (en). External Links: Link Cited by: §3.2.
I. Schuurman, M. Schouppe, H. Hoekstra, and T. van der Wouden (2003) CGN, an annotated corpus of spoken Dutch. In Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003, External Links: Link Cited by: Table S1, §3.1.
S. Seyfarth, M. Garellek, G. Gillingham, F. Ackerman, and R. Malouf (2018) Acoustic differences in morphologically-distinct homophones. Language, Cognition and Neuroscience 33 (1), pp. 32–49. External Links: ISSN 2327-3798, Link, Document Cited by: §5.
G. Shen, A. Alishahi, A. Bisazza, and G. Chrupała (2023) Wave to Syntax: Probing spoken language models for syntax. In Proc. Interspeech, pp. 1259–1263. External Links: Link, Document Cited by: §1, §2.1.
G. Shen, M. Watkins, A. Alishahi, A. Bisazza, and G. Chrupała (2024) Encoding of lexical tone in self-supervised models of spoken language. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard (Eds.), Mexico City, Mexico, pp. 4250–4261. External Links: Document Cited by: §2.1.
A. Tharwat, T. Gaber, A. Ibrahim, and A. E. Hassanien (2017) Linear discriminant analysis: A detailed tutorial. AI Communications 30 (2), pp. 169–190 (EN). External Links: ISSN 0921-7126, Link, Document Cited by: §3.2.
M. Tomasello (2003) Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press (en). External Links: ISBN 978-0-674-01764-1 Cited by: §1.
N. Vaessen, R. Ordelman, and D. A. van Leeuwen (2025) Self-supervised learning of speech representations with Dutch archival data. pp. 1208–1212. External Links: Link, Document Cited by: Table S1.
R. van Son, W. Wesseling, E. Sanders, and H. van den Heuvel (2008) The IFADV Corpus: a Free Dialog Video Corpus. In LREC 2008, N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, and D. Tapias (Eds.), Cited by: Table S1.
P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: footnote 5.
A. Weber and O. Scharenborg (2012) Models of spoken‐word recognition. WIREs Cognitive Science 3 (3), pp. 387–401 (en). External Links: ISSN 1939-5078, 1939-5086, Link, Document Cited by: §1.
J. F. Werker and R. C. Tees (1999) Influences on infant speech processing: Toward a New Synthesis. Annual Review of Psychology. External Links: Link, Document Cited by: §1.
J. F. Werker (2018) Perceptual beginnings to language acquisition. Applied Psycholinguistics 39 (4), pp. 703–728 (en). External Links: ISSN 0142-7164, 1469-1817, Link, Document Cited by: §1.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. v. Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv. External Links: Link, Document Cited by: Table S7, footnote 3.
W. Zuidema (2009) A syllable frequency list for Dutch. ILLC Preprint Series (PP-2009-50) (en). External Links: Link Cited by: Table S9.

Appendix A Model validation

We validated that capable speech SSL models were trained by observing that the pre-training losses for each model showed a stable decrease over training (train and validation loss curves are visualized in Figure S1). Additionally, we finetuned a sample of intermediate checkpoints of the Wav2Vec2 and HuBERT-I2 models on Dutch automatic speech recognition (Table S1), confirming that word-error rates are consistently lower for checkpoints later in training, and validating that the trained models also achieve reasonable downstream task performance. We note that results presented in our main text are obtained on checkpoints pre-trained only using SSL objectives, and do not include models fine-tuned for ASR. In further training up to 200K checkpoints, we noted instabilities and degraded downstream performance for HuBERT-I2 seed 1; we therefore exclude this model from our analyses on the 200K checkpoints (Figure 6).

Test set	Step	HuBERT-I2 (1)	HuBERT-I2 (2)	Wav2Vec2 (1)	Wav2Vec2 (2)
IFADV	100	0.97	0.98	0.97	0.97
(dialogues)	1000	0.98	0.97	0.98	0.98
	10000	0.86	0.86	0.91	0.91
	100000	0.67	0.65	0.71	0.70
	200000	0.68	0.66	0.69	0.69
NBest	100	0.83	0.85	0.83	0.82
(benchmark)	1000	0.83	0.83	0.84	0.85
	10000	0.62	0.60	0.68	0.69
	100000	0.26	0.25	0.32	0.32
	200000	0.27	0.26	0.30	0.29
CV	100	0.85	0.86	0.86	0.85
(read aloud	1000	0.85	0.85	0.89	0.89
sentences)	10000	0.65	0.62	0.72	0.72
	100000	0.22	0.22	0.30	0.30
	200000	0.23	0.21	0.28	0.27
MLS	100	0.67	0.68	0.67	0.67
(audiobooks)	1000	0.68	0.68	0.70	0.72
	10000	0.40	0.39	0.48	0.48
	100000	0.15	0.15	0.19	0.20
	200000	0.15	0.14	0.18	0.18
CGN-O	100	0.52	0.53	0.51	0.51
(audiobooks)	1000	0.52	0.52	0.54	0.58
	10000	0.26	0.25	0.32	0.33
	100000	0.08	0.08	0.11	0.11
	200000	0.09	0.07	0.10	0.10

Table S1: Word-error rate (WER) results for the HuBERT-I2 (1 & 2) and Wav2Vec2-NL (1 & 2) models pre-trained for this study, after finetuning intermediate checkpoints for Dutch automatic speech recognition (using the CGN-O component from the Spoken Dutch corpus, Schuurman et al., 2003); evaluated across several test sets: IFADV (van Son et al., 2008); the NBest benchmark (Vaessen et al., 2025); Dutch CommonVoice (Ardila et al., 2020); Multilingual LibriSpeech (Pratap et al., 2020); and a held-out part of CGN-O. Step refers to the pre-training step. Bold values mark the lowest WER per model and test set.

Appendix B Representational probing details

B.1 Supplementary results

B.2 Probe dataset design

We here include details on the data used for all our probing analyses, including the number of samples (occurrences of each item sampled from MLS data) and other relevant design choices.

Analysis	Number of categories	Data samples per category	Total samples	Data split constraints
Phone clustering	37 phones (see Table S8)	50	1850	80/20% train-test splits
Syllable form clustering	100 syllable forms (see Table S9)	20	2000	80/20% train-test splits
Syllable type clustering	20 types (see Table S9)	100 (5 syllable forms per type, with 20 samples each)	2000	80/20% train-test splits, syllable forms do not overlap between train & test
Word form clustering	250 word forms	10	2500	80/20% train-test splits
Part-of-speech clustering	4 part-of-speech categories	150 (50 word forms per category, with 3 samples each)	600	80/20% train-test splits, word forms do not overlap between train & test

Table S2: Dataset details for all clustering probes. For each analysis with number of categories

N

, we train LDA projections to an

N-1

dimensional subspace (following default configurations in the scikit-learn (Pedregosa et al., 2011) library); i.e. phone LDA projections consist of 36 discriminative directions, etc.

Summary statistics: Word form analysis data
Total word forms	250
Number of samples per word form	10
Total samples	2500
Numer of word forms per POS category	109 nouns, 44 adverbs, 25 adjectives, 21 verbs, 17 adpositions, 15 pronouns, 19 other
Duration (ms)	Mean: 538, Std.dev: 220
Length (number of phones)	Mean: 5.94, Std.dev: 2.22
Zipf frequency (Keuleers et al., 2010)	Mean: 3.95, Std.dev: 1.51
Age of acquisition rating (Brysbaert et al., 2014)	Mean: 9.93, Std.dev: 2.79

Table S3: Summary statistics on words used for word form clustering analyses. For our semantic alignment analyses, we randomly sampled 60 nouns, 20 adverbs, 20 adjectives, and 20 verbs out of the available word forms in this dataset, including all samples for each form (1000 samples in total).

Summary statistics: Homophone analysis data
Total number of triplets (A, B, X)	2326
Total number of unique words	4104
Word duration (ms)	Mean: 591, Std.dev.: 210
Word length (number of phones)	Mean: 6.51, Std.dev: 2.02

Table S4: Summary statistics on words used for the homophone disambiguation analyses.

Summary statistics: Part-of-speech analysis data
Number of word forms per POS category	50 nouns, 50 adverbs, 50 adjectives, 50 verbs
Number of samples per word form	3
Total samples	600
Duration (ms)	Mean: 434, Std.dev: 165
Length (number of phones)	Mean: 4.85, Std.dev: 1.79
Zipf frequency (Keuleers et al., 2010)	Mean: 5.01, Std.dev: 0.91

Table S5: Summary statistics on words used for the part-of-speech clustering analyses. The words used for part-of-speech clustering are sampled from the sentence dataset used for the syntactic structural probe (Table S6)

Summary statistics: Sentence analysis data
Total number of sentences	2406
Number of audiobooks that sentences were sourced from	10
Sentence duration (s)	Mean: 4.15, Std.dev: 2.45
Sentence length (number of words)	Mean: 10.6, Std.dev: 5.13
Word duration (ms)	Mean: 325, Std.dev: 188
Word dependency depth	Mean: 1.73, Std.dev: 1.07

Table S6: Summary statistics on sentences used for the syntactic structural probe analyses.

Analysis	Feature space	Feature details
Acoustic alignment	Mel-frequency cepstral coefficients (MFCCs)	20-dimensional, extracted using the librosa library (McFee et al., 2015), mean-pooled over time
Semantic alignment	Fasttext embeddings	300-dimensional, Fasttext model (Bojanowski et al., 2017) for Dutch released by Mikolov et al. (2018), accessed from the HuggingFace Hub through the transformers library (Wolf et al., 2020)

Table S7: Details on the feature spaces used for the acoustic and semantic alignment analyses. Acoustic alignment was computed on the same set of 1850 samples used for phone clustering (see Table S2).

Broad phon.

Syllable type	Syllable forms
CV	\tipaencodingd@, t@, x@, b@, l@
CVC	\tipaencodingv@r, d@r, l@k, k@r, t@x
CVV	\tipaencodingna:, tu:, ma:, re:, le:
CVVC	\tipaencodinghEit, vo:r, Va:r, Ve:r, hœys
VC	\tipaencodingOp, Af, Om, In, Ax
CVCC	\tipaencodingk@nt, d@rs, r@nt, V@rk, G@nt
VVC	\tipaencodingœyt, a:n, e:n, o:r, e:r
CCVV	\tipaencodingtsi:, ste:, dra:, sxa:, pro:
CVVCC	\tipaencodingvo:rt, te:rt, ho:ft, ma:kt, be:lt
CCVVC	\tipaencodingsta:t, sxo:l, sla:n, kle:t, pla:t
VV	\tipaencodingo:, a:, e:, i:, y:
CCVC	\tipaencodingtj@s, st@r, sxAp, slAx, stOr
CCV	\tipaencodingst@, tj@, pj@, pl@, br@
VCC	\tipaencodingOnt, Axt, Ant, Arm, Eks
CCVCC	\tipaencodingstElt, st@rs, stAnt, brAxt, stOrm
CCVVCC	\tipaencodingpla:ts, krEixs, sta:ts, bre:kt, sxe:ps
CCCVV	\tipaencodingsxrei, spre:, stra:, sxre:, strEi
CCCVC	\tipaencodingstrAf, sprON, sxrIf, sxrOm, strEk
CVVCCC	\tipaencodingdi:nst, Va:rts, ka:tst, la:tst, vEinst
CCCVCC	\tipaencodingsprINt, sxrIkt, strEkt, strAft, strAnt

Tracking the emergence of linguistic structure in self-supervised models learning from speech