Loci Similes: A Benchmark for Extracting Intertextualities
in Latin Literature

Julian Schelb, Michael Wittweiler, Marie Revellio‡∗,
Barbara Feichtinger and Andreas Spitz†∗
Department of Computer and Information Science, University of Konstanz
Department of Latin Philology, University of Konstanz
Department of Archaeology, Classical Philology and Ancient Studies, University of Zurich
{firstname.lastname}@uzh.ch
{firstname.lastname}@uni-konstanz.de
Abstract

Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of \sim172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.

**footnotetext: These authors contributed equally as project lead

1 Introduction

Identifying intertextual connections between documents is an important task in classical philology, as it reveals how later works engage with earlier texts and traditions. For centuries, scholars detected intertextual references by relying on memory and the manual collation of Loci Similes, i.e., parallel passages that exhibit lexical, semantic, or thematic resemblance. Although digitization has augmented this process through lexical search tools, most approaches still depend on exact n-gram matching or heuristic filtering (DBLP:journals/dhq/SchroppKRF24). This is limiting for ancient texts, where intertextuality often manifests itself not as verbal quotation, but as subtle allusion, paraphrase, or thematic variation (DBLP:conf/latech/ManjavacasLK19; gong_augmented_2025), often complicated by orthographic volatility (Miller2025Alignment) (see Figure 2).

Example of Historical Text Reuse Source: Virgil, Aeneid 2.774 Context: Aeneas is terrified when the ghost of his wife Creusa appears to him during the burning of Troy. “… obstipui, steteruntque comae et uox faucibus haesit.”
(… I was stupefied, my hair stood on end, and my voice stuck in my throat.)
Reuse: Jerome, Epistula 130.5.5 Context: A family’s shocked reaction to Demetrias’s vow of Christian virginity. Haesit uox faucibus et inter ruborem atque pallorem metumque ac laetitiam cogitationes uariae mutabantur.”
(The voice stuck in their throat, and between blushing and pallor, fear and joy manifold thoughts kept shifting.)
Figure 1: Example of intertextual reference. Reuse of a classic Vergilian phrase for speechlessness by Jerome. While retaining the semantic core, the author alters the word order to adapt the expression to a different context.

Recovering such textual reuses is not merely a matter of identifying sources. It facilitates research on broader cultural-historical phenomena (tangherlini2024travels). In particular, it supports work on reception and cultural hybridization in Late Antiquity, where pagan texts persist as the rhetorical substrate of elite writing while being recontextualized within emerging Christian discourse. Classical forms often remain recognizable even as their functions shift toward Christian meaning-making, a phenomenon visible in both syntactic stylometry (gorman2016approaching; DBLP:journals/corr/abs-2109-00601) and semantic motifs. In this sense, quotation patterns suggest how Christian authors do not abandon classical texts and their cultural contexts, but reuse their language, redirecting its meanings and connotations within Christian interpretive frameworks.

A case in point is the Church Father Jerome. Recent digital-hermeneutic studies have begun to map his “micro-quotations” (DBLP:journals/dhq/SchroppKRF24), yet the semantic breadth of his reuse remains a challenge. When Jerome alludes to the Augustan poet Virgil, he often retains the semantic core of a hexameter verse while altering its word order or syntax to suit his Christian prose context. In Jerome’s writings, the general tension between his pagan paideia and Christian discourse emerges with particular clarity in the details of how he quotes classical pagan sources and adapts them for his own texts (see Figure 1). Since canonical authors such as Virgil were deeply embedded in the educational curriculum, quoting them often served as a shorthand for shared cultural memory. Late Antique Christian writers like Jerome inherit this repertoire, but their reuse frequently reframes pagan language within Christian contexts, making citation patterns a measurable trace of shifting cultural authority.

Beyond their extraction, systematically mapping these dependencies allows scholars to reconstruct the “virtual library” available to an author, gaining insight into which sources most strongly shaped their writing. Furthermore, analyzing how these connections cluster, ranging from explicit citations to subtle echoes, helps refine theoretical definitions of intertextuality and investigate the rhetorical motivations behind text reuse. However, the development of automated methods for detecting intertextual connections with language models is held back by the absence of a standardized benchmark and easily accessible datasets.

In this paper, we take a step towards addressing this gap by introducing Loci Similes, a benchmark designed to enable researchers to systematically compare and evaluate computational approaches for intertextuality detection.

Our paper makes the following key contributions:

  • Curated benchmark dataset of \sim172k Latin text segments, partitioned into a query and source corpora, accompanied by a ground truth dataset of 545545 expert-verified intertextual links.

  • Evaluation framework that shifts the paradigm from standard query-target sentence matching of Information Retrieval to whole-document comparison, aligning more closely with the practical constraints of philological workflows.

  • Baseline results for embedding models, classification models, and an end-to-end pipeline for intertextuality detection, serving as a foundation for future comparisons.

2 Related Work

2.1 Intertextuality Detection in Latin

Previous research on Latin intertextuality has mainly relied on matching word-level embeddings. DBLP:conf/naacl/BurnsBLCD21 utilized static Word2Vec models trained on lemmatized text to rank potential intertextual phrases, evaluating performance against a dataset of 945 parallels from Valerius Flaccus’ Argonautica dexter_ldquodatabase_2024. Although this approach outperformed traditional lexical methods in capturing semantic similarity, it is limited by the inability of static embeddings to handle context-dependent polysemy.

Addressing this limitation, gong_augmented_2025 substituted static vectors with LatinBERT DBLP:journals/corr/abs-2009-10053, a transformer-based model generating context-aware embeddings. By computing similarities between source words and target bigrams, their system successfully identified allusions in Lucan’s Pharsalia, with a user study confirming its utility for philological interpretation compared to alternative tools. DBLP:conf/latech/ManjavacasLK19 modeled the detection of biblical allusions in Latin sermons as an information retrieval task, finding that semantic FastText embeddings improve ranking over lexical baselines. DBLP:conf/chr/ManjavacasKK20 moved beyond pure retrieval to analyze the contextual drivers of intertextuality in the Patrologia Latina by modeling lexical similarity and thematic embedding as separate axes.

revellio2022zitate introduced a mixed digital-hermeneutic approach, designed to uncover previously undetected Vergilian citations and allusions in Jerome’s letter corpus and, in doing so, to examine Jerome’s patterns of quotation and adaption while probing the outer limits of quotation itself. Building on this foundation, DBLP:journals/dhq/SchroppKRF24 further elaborate this procedure at the level of automated pre-selection: starting from “Simillima’s” raw two-word matches, they expand and recalibrate the cascade of rule-based filters (e.g., stop-lists, part-of-speech constraints, and lemma distance criteria) in order to make the identification of very short “micro-quotations” feasible at scale.

Both approaches are restricted to n-grams. In contrast, DBLP:conf/acl/Riemenschneider23 applied the Sentence Transformer architecture DBLP:conf/emnlp/ReimersG19 to ancient languages. They introduced SPhilBerta, a multilingual model fine-tuned on parallel sentences in Ancient Greek, Latin, and English, framing the task as binary classification over sentence pairs.

DBLP:journals/corr/abs-2109-00601 used Latin stylometry by generating contextual embeddings with LatinBERT for a corpus of 39 classical and medieval texts. Their unsupervised clustering approach in authorship attribution offered quantitative evidence connecting the anonymous Gesta Principum Polonorum to the Monk of Lido.

2.2 Text Reuse in Other Languages

Beyond the specific domain of Latin literature, automated detection of text reuse has been explored across languages and tasks.

Existing general-purpose frameworks draw on literary theory to distinguish between explicit citations and implicit commentary using graph-based approaches DBLP:journals/coling/KuznetsovBEG22, and take advantage of LLM-assisted metadata extraction and vector similarity to detect explicitly marked and implicit cross-document influences DBLP:journals/corr/abs-2410-15145. In general, current trends indicate that traditional lexical methods are outperformed by contextual models; for example, DBLP:conf/starsem/MacLaughlinXS21 modeled reuse detection as a binary classification task using fine-tuned BERT and Longformer models.

Within computational literary studies, researchers have applied contextual models to track reuse between genres and periods. In the bibliometric domain, DBLP:conf/ecir/BertinA18 constructed InTeReC, a corpus of more than 300,000 sentences with citations extracted from English-language PLOS articles. For French fiction, DBLP:journals/corr/abs-2410-17759 applied language models to a corpus of 12,000 novels from the 18th to early 20th centuries. In the Scandinavian context, tangherlini2024travels mapped links between Hans Christian Andersen’s travelogs and fairy tales using Sentence-Transformers and BERTopic to uncover latent motifs. Additionally, liebl-burghardt-2020-shakespeare identified Shakespearean quotes and paraphrases within contemporary fiction.

In addition to general-purpose methods, specialized approaches address low-resource languages and their unique challenges. Miller2025Alignment enhanced reuse detection in Hebrew and Aramaic manuscripts by incorporating fastText embeddings, while gorman2016approaching modeled authorial style in Ancient Greek by analyzing “syntax words” via unsupervised hierarchical clustering. Additionally, DBLP:journals/talip/SharjeelMNNR23 detected cross-lingual reuse in English-Urdu news articles by combining machine translation with classifiers.

2.3 Further Related Tasks

The detection of intertextualities shares concepts with other NLP tasks that identify dependencies or similarities between texts: quote detection DBLP:conf/nodalida/JanickiKM23, paraphrase identification DBLP:journals/es/VrbanecM23, and document alignment DBLP:conf/eacl/MolfeseBTCN24.

Quote detection typically targets direct, explicitly marked quotations within contemporary corpora DBLP:conf/konvens/PetersenFreyB24; DBLP:conf/lrec/ZhangL22; DBLP:conf/ranlp/PapayP19; DBLP:conf/wsdm/VaucherSC021. This contrasts with our focus on historical intertextuality, which often relies on subtle, unmarked allusions rather than verbatim reuse. Similarly, paraphrase detection identifies semantic similarity often in the context of plagiarism DBLP:conf/paclic/MahmoudZ22; DBLP:conf/emnlp/WahleRKG22; DBLP:journals/corr/abs-2303-13989; DBLP:conf/emnlp/WahleGR23. However, intertextuality involves recontextualization, requiring the detection of stylistic echoes where the semantic core has shifted to serve a new argument. Finally, document alignment focuses on matching entire documents between parallel texts or translations DBLP:conf/ranlp/RajithaPSR21; DBLP:journals/corr/abs-2510-15577. In contrast, intertextuality detection operates at a granular level, identifying isolated references between documents that are otherwise structurally and semantically distinct.

Taxonomy of Intertextual Links 1. Verbatim Quote (Literal Reuse) Marked: Explicitly attributed to a specific author or source (e.g., ut ait Cicero… →“as Cicero says”). Unmarked: Silent formatting or incorporation of exact text strings from other sources. 2. Paraphrase (Lexical Modification) Minor: Small morphological changes (e.g., case, number, tense, word order). Major: Significant lexical substitution (synonyms) or syntactic restructuring. 3. Allusion (Semantic Similarity) Single Reference: Shared distinct vocabulary or imagery recontextualized in a new setting. Systemic: Broad stylistic, rhythmic, or thematic imitation on multiple occasions in the document.
Figure 2: Spectrum of intertextuality. References manifest in diverse forms, spanning from easily detectable verbatim quotations to adapted paraphrases and subtle allusions where only a semantic core remains.

3 Dataset Construction

While traditional scholarship has documented numerous Latin intertextual parallels, computational research remains constrained by the lack of standardized benchmark datasets. Aggregating these references is non-trivial, as the data is dispersed across commentaries, indices locorum (“back-of-the-book indexes”), and focused philological case studies. To address this gap, we curated a corpus of \sim172k text segments spanning works by multiple Latin authors (see Table 1) and a ground truth dataset of 545 confirmed intertextual links identified within these texts (see Figure 3).

3.1 Latin Corpus Curation

We divide the corpus into a separate Query Corpus and a Source Corpus. The former comprises works by the Late Antique authors Jerome and Lactantius, totaling \sim83k text segments. The latter, which serves as the retrieval target, consists of \sim88k segments drawn from ten canonical classical Latin authors: Cicero, Lucretius, Catullus, Virgil, Horace, Tibullus, Propertius, Ovid, Lucan, and Martial. A detailed breakdown of segment counts and token statistics for each author is presented in Table 1. The texts were aggregated from three primary digital repositories: Corpus Corporum, the Tesserae Project, and the OpenGreekandLatin Project. For a complete listing of the specific works, critical editions utilized, and their respective provenance, see the detailed breakdown in Appendix A (Table 2). Since we exclusively rely on publicly available texts to compile the corpus, we released the corpus files alongside the ground-truth dataset.

3.2 Ground Truth Construction

To construct our ground truth dataset of 545 confirmed intertextual links, we draw from two complementary sources.

First, we sourced 270 instances of references to Virgil and Cicero in Jerome’s Epistulae from the dataset111https://doi.org/10.11588/data/FVCULR established by DBLP:journals/dco/SchroppWKRF24, which generally fall into three categories: verbatim quote, paraphrase, and allusion (see Figure 2). Notably, we do not retain all entries from the original dataset on a one-to-one basis. While DBLP:journals/dco/SchroppWKRF24 treat each poetic verse as a separate unit, our dataset operates at an approximate sentence level to ensure alignment between source and reused material. Consequently, we consolidated multiple entries into a single instance in cases where citations span two or more verses. Furthermore, our expert annotators rejected links from previous scholarship that were deemed controversial or lacking sufficient evidence of intertextuality.

Second, we augmented these explicitly attested references by adding 275 additional intertextual links. We first identify candidates through a rule-based n-gram matching approach published by DBLP:journals/dhq/SchroppKRF24.222https://github.com/MWittweiler/citation_analysis Their pipeline identifies candidate intertextual pairs by scanning for shared, non-contiguous tokens within a defined window. To ensure meaningful matches, candidates are then refined through a cascade of filters that remove high-frequency stopwords, enforce part-of-speech constraints, and exclude generic collocations based on embedding similarity. Subsequently, the candidates are manually evaluated by domain experts, and only those confirmed to constitute meaningful intertextual links are retained in the dataset.

Refer to caption
Figure 3: Distribution of confirmed references. Manually verified intertextual links in the annotated dataset by citing author (Jerome, Lactantius) and source author (including Virgil, Cicero, and others).
Author Segments Avg. Tokens Min Max Std. Dev.
Query Corpus
Jerome 74,672 30.94 1 582 22.14
Lactantius 8,444 28.07 1 369 19.01
Total / Avg. 83,116 30.65 1 582 21.86
Source Corpus
Cicero 54,331 28.61 1 847 24.77
Ovid 14,096 25.88 1 396 18.81
Virgil 4,861 29.49 1 351 19.50
Martial 4,114 24.60 2 249 19.28
Lucan 2,952 31.24 2 243 21.80
Horace 2,353 32.39 2 193 24.09
Propertius 1,889 22.12 2 267 17.51
Lucretius 1,826 40.74 3 255 27.25
Catullus 809 27.34 2 149 22.00
Tibullus 788 26.09 2 251 20.66
Total / Avg. 88,019 28.30 1 847 23.20
Table 1: Corpus statistics by author. Breakdown of the dataset into Query (citing authors) and Source (referenced authors) corpora, showing the number of segments and token statistics for each author.

3.3 Annotation Process

Although the filtering process of the candidate generation pipeline removed many false positives, numerous candidates still involved common collocations (e.g., puncto temporis “in a moment”) and required manual exclusion. Therefore, candidates identified by the nn-gram pipeline were annotated by a team of four experts in Latin literature (two pre-PhD and two post-PhD researchers), who assessed whether the lexical overlap constituted a meaningful intertextual reference or merely a coincidental overlap or common phrase. Conflicts in the annotation were resolved through group discussion. Our expert annotators used the following three criteria to guide their annotation in identifying meaningful intertextual links:

  1. 1.

    Use of Uncommon Vocabulary: If the lexical overlap consists of rare or marked words, the likelihood increases that these elements were deliberately borrowed.

  2. 2.

    Attested Frequency: A specialized Latin corpus database333http://clt.brepolis.net/llta/Search served to help determine the frequency of overlapping expression in Latin literature. Expressions appearing exclusively in the candidate source and target passages were treated as strong indicators of a unique intertextual relationship.

  3. 3.

    Conduit Function: Most importantly, if the cited expression contributes semantic, rhetorical, or cultural information from the source text that cannot be derived from the target passage in isolation, thereby implicitly enriching the interpretation, it is viewed as intertextual.

We release the complete dataset including the query texts, source corpus, and annotated intertextual links as a HuggingFace dataset.444https://huggingface.co/collections/julian-schelb/datasets-for-latin-intertextuality-search.

4 Evaluation Framework

To benchmark intertextuality detection methods, we also developed an evaluation framework that serves as the basis for the baseline experiments in this study. Our motivation is to reframe the task as a segment-wise comparison of whole documents since this formulation is a more appropriate representation of the work of scholars analyzing historical texts than traditional information retrieval pipelines.

4.1 Task Definition

We define the task of intertextuality detection as the identification of directional dependencies between two texts: a query document (the chronologically later text) and a source document (the earlier text). We formulate this as a retrieval and alignment problem where the objective is to map each text segment in the query document to a candidate set of text segments in the source document. This mapping is inherently one-to-many; a single query sentence may correspond to zero, one, or multiple distinct segments in the source text, exhibiting either semantic equivalence or stylistic imitation.

4.2 Evaluation Metrics

Since we compare all text segments of the query and source documents, and given that many queries will correspond to no intertextual links, we require metrics that effectively account for the correct rejection of non-links (true negatives). We therefore employ error-based metrics normalized by the total number of sentence pairs (NN): Segment-Misclassification Rate (SMR), which serves as a global error rate; Global False-Positive Rate (FPR), measuring spurious matches; and Global False-Negative Rate (FNR), measuring missed references. The formal definitions and equations for these metrics are provided in Appendix B.1.

4.3 Framework Features

The locisimiles package streamlines the end-to-end workflow by allowing users to load query and source documents split into text segments, which are then compared to generate a list of potential intertextual links. The library provides reference implementations for three core architectures: a classification-only pipeline, a retrieval-only pipeline, and a hybrid retrieve-and-rerank pipeline. Beyond these baselines, the framework is designed to be extensible, allowing researchers to implement and test custom approaches by adding their own pipelines. When a ground-truth dataset is provided, the package evaluates a given pipeline using the error-based metrics described above, which are selected to reflect the reality of scholarly editorial work and the standard classification metrics. The software is released as a standard Python package that includes a graphical user interface (GUI) component for ease of use and is available as the open-source package locisimiles555https://github.com/julianschelb/locisimiles. See Appendix D for details on how to use the package.

5 Baseline Methods

To provide a best-effort baseline for the extraction of intertextualities, we frame the identification of intertextual links as an Information Retrieval (IR) task (see Figure 4). Given a query text segment from the author under investigation (e.g., Jerome), our system queries an index of source texts (e.g., Virgil) to identify potential allusions. The pipeline operates in two phases: first, a bi-encoder model performs dense retrieval to generate a candidate set of potential references per query; second, a classification model performs binary classification to re-rank the top-kk candidates. This approach allows us to filter vast amounts of text rapidly using precomputed embeddings while reserving computationally expensive comparisons using the classification model for the most promising candidates.

5.1 Information Retrieval

Unlike traditional methods that rely on sparse lexical overlap, we leverage language models to encode text segments into dense vector representations. Specifically, we treat text segments from the author under investigation (e.g., Jerome) as queries to be matched against an index of source texts (e.g., Virgil). We employ sentence transformer models to generate embeddings for both the query and source segments. During inference, the bi-encoder is used to obtain the cosine similarity between the query embedding and every source embedding in the database. The candidates are then ranked by their similarity scores, identifying the most likely intertextual references.

Training setup.

To adapt the bi-encoder for this specific domain, we fine-tuned the model using Online Contrastive Loss to produce similar embeddings for pairs constituting an intertextual link and dissimilar embeddings for unrelated segments. Fine-tuning was performed over 4 epochs with a batch size of 32, using a learning rate of 2×1052\times 10^{-5} and applying a weight decay of 0.01 to prevent overfitting. Consistent with state-of-the-art dense retrieval approaches DBLP:conf/acl/SuSKWHOYSZ023, we distinguish the asymmetric roles of text pairs by prepending the prefixes “Query: ” to target text segments and “Candidate: ” to source text segments prior to encoding.

5.2 Binary Classification

Following the initial retrieval stage, we employ a cross-encoder architecture to re-rank the top-kk candidates. Unlike the bi-encoder, which generates static embeddings for query and source segments independently, the cross-encoder processes the query and candidate as a single paired input DBLP:conf/emnlp/ReimersG19. This allows the self-attention mechanism to directly compare the two segments at the token level and capture interactions and dependencies between them.

Input construction.

To construct the input for the binary classifier, we concatenate the target sentence (query) and the retrieved source sentence (candidate), separated by the model’s specific special tokens (see Figure 4). To prevent either segment from dominating the input window, we truncate both the query and candidate segments to 50%50\% of the available tokens budget before concatenation.

Training setup.

We fine-tuned the model to predict the probability that a given pair constitutes a valid intertextual link. The model was trained for 4 epochs with a batch size of 32, using the AdamW optimizer with a learning rate of 2×1052\times 10^{-5}. The model minimizes the cross-entropy loss between the predicted similarity scores and the ground truth binary labels.

Refer to caption
Figure 4: Retrieve-and-rerank pipeline. Stage 1: The input text segment acts as a query to retrieve potential candidates from the database. Stage 2: To verify the reference, the query and source candidate are concatenated into a single input sequence to train a binary classifier.

6 Experimental Setup

In this section, we utilize the dataset constructed in Section 3 and the evaluation framework described in Section 4 to establish baseline results for both information retrieval and classification approaches.

6.1 Dataset Split

To ensure statistical robustness, we employ kk-fold cross-validation (k=5k=5) on the set of 545 verified positive text segment pairs. For each training fold, we generate contrastive examples by augmenting the assigned positive pairs with nn negative samples. In the evaluation phase, we reconstruct a realistic retrieval scenario to test the model’s ability to distinguish genuine citations from background noise: the query document combines the fold’s held-out positive instances with a subset of the query corpus (totaling 937 segments), while the source document consists of 880 source segments.

6.2 Model Configuration and Ablations

To identify the optimal pipeline architecture, we conducted extensive independent evaluations of the dense retrieval and binary classification stages. Details on base model evaluation, negative sampling ablation studies, and hyperparameter configurations are provided in Appendix B.

  • Base Models: We evaluated a range of pre-trained language models to isolate the effects of domain specificity versus model size.

  • Negative Sampling: We compared three distinct training strategies (random pairs, random negatives, and hard negatives) while varying the positive-to-negative ratio from 1:11{:}1 to 1:101{:}10 to optimize the model’s discrimination capability.

  • Combined Pipeline: For the end-to-end pipeline, we varied the retrieval depth kk to identify the operating point that minimizes recall loss while maximizing computational efficiency.

Refer to caption
Figure 5: Performance vs. efficiency trade-off in the retrieve-and-rerank pipeline. The pipeline first generates kk candidates via embedding cosine similarity, followed by a binary classification stage to label pairs as Reference or No Reference. We compare this against a “Retrieval Only” baseline where the top kk candidates are treated as positive predictions, mimicking a scholar manually reviewing the top results. Results are averaged across 5 folds.

7 Experimental Results

In this section, we present our quantitative findings.

7.1 Information Retrieval Results

We evaluated the impact of training data imbalance by varying the positive-to-negative sample ratio from 1:11{:}1 to 1:101{:}10. Performance improved consistently with more negative examples, and the 1:101{:}10 ratio yielded optimal results, suggesting potential gains from further increasing the negative sample proportion. Comparing base models showed that large multilingual models outperformed native Latin baselines. E5-large proved to be the most effective, retrieving approximately 61% of relevant references at Recall@10 (72% at Recall@100, 83% at Recall@1000). For a detailed evaluation of negative sampling and base models, see Appendix C.1.

7.2 Binary Classification Results

Training with hard negatives (qry,sim\langle\text{qry},\text{sim}\rangle) minimized error rates more effectively than random pairing. This approach enforces finer semantic distinctions, allowing the model to better distinguish genuine intertextuality from mere topical similarity. Second, we analyzed the impact of the imbalance in the training data by varying the ratio of negative-to-positive samples from 1:11{:}1 to 1:101{:}10. Consistent with our findings in the retrieval stage, increasing the volume of negative examples improves the robustness of the model in imbalanced scenarios. Finally, in terms of base model architectures, XLM-RoBERTa Large achieves an average F1-score of 0.5 along with the lowest average SMR of 35 misclassified links per 10,000 candidates. For detailed evaluation results, see Appendix C.2.

7.3 Retrieve and Rerank Results

We evaluate the full two-stage retrieve-and-rerank pipeline, which first generates kk candidates via embedding cosine similarity, followed by a binary classification stage to label pairs as Reference or No Reference. As a baseline, we consider a “retrieval only” approach where the top kk candidates are treated as positive predictions, simulating the workload of a scholar manually reviewing the top results. Experiments are averaged across 5 folds. For this experiment, we use a larger corpus of 937 query and 880 source segments (\sim109 ground-truth references per fold). As shown in Figure 5, the “retrieve-and-rerank” pipeline reduces false positives compared to the retrieval-only approach. At k=5k=5, it achieves an F1-score of 0.66 versus 0.26, and maintains an F1 of 0.55 at k=100k=100 where pure retrieval drops to 0.02. Crucially, the SMR remains orders of magnitude lower (0.0019 vs. 0.57 at k=500k=500), indicating that the classifier rejects the vast majority of spurious candidates.

In practical terms, with a reasonable retrieval depth of k=100k=100, the retrieve-and-rerank pipeline recovers 79% of true intertextual references (85 out of 108). A scholar utilizing this system would need to manually review 780 candidate passages to verify the 85 genuine intertextualities, instead of reviewing the prohibitively large number of 93,700 cases with the retrieval-only pipeline (see Table 3). This corresponds to a 99% reduction in workload while retaining nearly 80% of the relevant data.

8 Discussion

Although the bi-encoder reliably retrieves long literal citations and thematic allusions with strong semantic cores, it struggles when minimal lexical reuse occurs in divergent contexts. This results in low similarity scores and therefore low rankings for valid candidates, causing them to be discarded before the classification stage. In general, larger multilingual models trained with higher negative sampling ratios achieve superior retrieval performance. The classifier correctly identifies these low-overlap references if they pass the retrieval step. We observe consistently high recall but low precision, suggesting that classification models struggle to distinguish meaningful intertextuality from the coincidental use of common phrases.

9 Conclusion

In this work, we introduced Loci Similes, a benchmark dataset for Latin intertextuality detection designed to evaluate the capacity of language models to capture semantic similarity beyond exact lexical matching. Our baseline experiments demonstrate that while dense retrieval effectively identifies long literal citations and thematic allusions, detecting subtle “two-word congruencies” remains challenging. Among the evaluated models, E5-large achieved the best retrieval and XLM-RoBERTa Large the best classification results. Overall, our findings suggest that language models offer a promising avenue for this task, but the primary challenge lies in distinguishing meaningful reuse from coincidental lexical overlap, which will require further advances and the development of dedicated architectures for the detection of intertextualities.

Future Work.

We aim to expand our dataset by incorporating existing collections of intertextual links in Latin literature (e.g., dexter_ldquodatabase_2024 and DBLP:journals/dco/SchroppWKRF24) to better enable models to capture wide ranges of stylistic deviations.

Limitations

Data Coverage.

Although our dataset comprises expert-verified positive pairs, it is not (and cannot be) exhaustive. The labeled dataset likely omits some valid references between the selected works. Consequently, it is possible that a few instances classified as “false positives” may represent genuine but undocumented intertextual links, potentially skewing the reported precision.

Labeling Ambiguity.

Defining what constitutes a “reference” in Classical Philology remains a substantial methodological challenge. The boundaries between literal citation, subtle allusion, and general thematic resonance are fluid and subject to differing scholarly definitions. This inherent ambiguity affects the consistency of manual labeling, as annotators may prioritize differing criteria for intertextuality. While the taxonomy in Figure 2 provides a useful analytical framework, it does not apply uniformly across all instances of intertextuality. Many cases exhibit characteristics of multiple categories, and only a minority can be unambiguously assigned to a single class.

Acknowledgments

This work was supported by the German Research Foundation (DFG) in the project “Zitieren als narrative Strategie. Eine digital-hermeneutische Untersuchung von Intertextualitätsphänomenen am Beispiel des Briefcorpus des Kirchenlehrers Hieronymus” [Grant ID: 382880410].

AI Usage Statement

Language model-based AI tools (Codex and Github Copilot) were used as coding assistants in the implementation and as writing assistants in drafting parts of the manuscript. The final version of the manuscript was written without AI input.

References

Appendix A Corpus Sources

The texts used to construct our corpus were aggregated from three digital repositories: Corpus Corporum666https://mlat.uzh.ch/, the Tesserae Project777https://github.com/tesserae/tesserae, and the OpenGreekandLatin Project. Table 2 details the specific editions used and Figure 13 shows some examples.

Author Work Source Edition
Virgil Aeneid CC Greenough (1900)
Virgil Georgics CC Greenough (1900)
Virgil Eclogues CC Greenough (1900)
Ovid Amores CC Ehwald (1907)
Ovid Ars Amatoria CC Ehwald (1907)
Ovid Ex Ponto CC Wheeler (1939)
Ovid Fasti CC Frazer (1933)
Ovid Heroides CC Ehwald (1907)
Ovid Ibis CC Merkel/Ehwald (1889)
Ovid Medicamina CC Ehwald (1907)
Ovid Metamorphoses CC Magnus (1892)
Ovid Remedia Amoris CC Ehwald (1907)
Ovid Tristia CC Wheeler (1939)
Martial Epigrammata CC Heraeus (1925)
Lucretius De Rerum Natura CC Martin (1934)
Lucan Pharsalia CC Weise (1835)
Horace Carmen Saeculare CC Shorey (1898)
Horace Carmina CC Shorey (1919)
Horace Ars Poetica CC Smart (1836)
Horace Epistulae CC Fairclough (1929)
Horace Epodes CC Vollmer (1912)
Horace Saturae CC Smart (1836)
Catullus Carmina Tess Merrill
Propertius Elegiae Tess Mueller (1898)
Tibullus Elegiae Tess Postgate (1915)
Cicero Opera Omnia Tess Varia
Jerome Epistulae OGL Hilberg (1910)
Jerome Varia CC Patrologia Latina (1845)
Table 2: Corpus data sources. Abbreviations: CC (Corpus Corporum), Tess (Tesserae Project), and OGL (OpenGreekandLatin Project).

Appendix B Experimental Setup Details

Experiments on the classification model and the full retrieve-and-rerank pipeline utilize the directional comparison task between query and source documents, evaluated using 5-fold cross-validation on the 545 verified positive pairs.

B.1 Evaluation Metrics

Unlike standard information retrieval tasks that focus on ranking top-kk candidates, our objective is to classify the entire set of possible links between the query and source documents. Since many text segments in the query document do not have a true positive match in the source document, we require metrics that account for the correct rejection of non-links (true negatives). Consequently, we define N=TP+FP+FN+TNN=\text{TP}+\text{FP}+\text{FN}+\text{TN} as the total number of text segment pairs and introduce the following error-based metrics to align more closely with the practical constraints of philological workflows.

  • Segment-Misclassification Rate (SMR): defined as the fraction of all query-source pairs that were misclassified. This serves as a global error rate. Values range from 0 (perfect retrieval) to 11 (complete failure).

    SMR=FP+FNN\text{SMR}=\frac{\text{FP}+\text{FN}}{N}
  • Global False-Positive Rate (FPR): defined as the share of the total dataset incorrectly predicted as links. A high FPR indicates a system prone to “over-generating” candidate links.

    FPR=FPN\text{FPR}=\frac{\text{FP}}{N}
  • Global False-Negative Rate (FNR): defined as the share of the total dataset that contains true links missed by the system. A high FNR indicates that genuine intertextual references are remaining undiscovered.

    FNR=FNN\text{FNR}=\frac{\text{FN}}{N}

These metrics allow us to decompose the total error (SMR) into its constituent types (FPR and FNR), providing insights into whether the model is biased towards over-generation or under-retrieval. We calculate these metrics individually for each query segment and report the mean value averaged over all queries.

B.2 Base Model

To identify the most effective architectures for our intertextuality detection pipeline, we conducted a comparative analysis of pre-trained models for both pipeline stages.

Retrieval Models.

For the embedding stage, we evaluated the multilingual E5 family (Small, Base, and Large) DBLP:journals/corr/abs-2402-05672, the Granite embedding models (107m and 278m) DBLP:journals/corr/abs-2502-20204, and BGE-M3 DBLP:journals/corr/abs-2402-03216. Additionally, we included SPhilBerta DBLP:conf/acl/Riemenschneider23 to benchmark a domain-adapted model pre-trained on classical languages.

Classification Models.

For the rerank stage, we selected ten architectures to analyze the trade-off between multilingual generalization and domain-specific pre-training. Our multilingual baselines include the XLM-RoBERTa family (Base and Large) DBLP:conf/acl/ConneauKGCWGGOZ20, ModernBERT (Base and Large) DBLP:conf/acl/WarnerCCWHTGBLA25, and mmBERT (Small and Base) DBLP:journals/corr/abs-2509-06888. To measure the impact of domain adaptation, we evaluated PhilBerta DBLP:conf/acl/Riemenschneider23, LaBerta, and RoBERTa-Latin, which are pre-trained on Latin corpora. Finally, we included BERT-Romanian DBLP:conf/emnlp/DumitrescuAP20 to test cross-lingual transfer from a related Romance language.

B.3 Negative Sampling Methods

The quality of dense cross-encoder models depends on the negative examples used during training. We evaluated three negative sampling strategies:

  • Random pairs (random,random\langle\text{random},\text{random}\rangle): Pairs are formed by selecting two completely disjoint segments at random from the corpus.

  • Random negatives (query,random\langle\text{query},\text{random}\rangle): For each positive query, we sample a negative candidate uniformly at random from the remaining corpus.

  • Hard negatives (query,similar\langle\text{query},\text{similar}\rangle): We utilize a pre-trained embedding model to identify “hard negatives”, candidates that are semantically similar to the query but are not true intertextual references.

Appendix C Experimental Results

This section presents the detailed quantitative findings of our ablation studies, benchmarking the impact of negative sampling, model architecture, and hyperparameter tuning across both the retrieval and classification stages.

C.1 Information Retrieval Results

We evaluate the dense retrieval (bi-encoder) component using information retrieval metrics to ensure high-quality candidate filtering. As shown in Table 4 and Figure 6, larger multilingual models consistently achieve higher recall scores. Figure 7 further details this performance by displaying the recall scores for the best-performing model for each fold. Complementing this, Figure 8 indicates that increasing negative training ratios improves recall, while Figure 9 isolates the best-performing learning rate and epoch configuration.

Refer to caption
Figure 6: Recall@kk comparison across embedding model families. Performance of E5, Granite, BGE-M3, and the domain-specific SPhilBerta on the Latin intertextuality retrieval task. We observe that larger models show better performance, with multilingual E5-large achieving the highest scores overall. Error bands show ±\pm1 std across 5-fold cross-validation.
Refer to caption
Figure 7: Recall@kk performance of the best embedding model across individual folds. We show recall at varying cutoff values (k{1,5,10,20,100,1000,10000}k\in\{1,5,10,20,100,1000,10000\}) for the top-performing model on the Latin intertextuality retrieval task. Each bar represents a single fold from 5-fold cross-validation, with the orange line indicating mean ±\pm std. Performance improves substantially with larger kk, approaching near-perfect recall at k=10000k=10000.
# Predictions Retrieval Only Retrieve+Rerank FPR ×104\times 10^{4} SMR ×104\times 10^{4}
kk Ret. Rer. TP FP FN F1 TP FP FN F1 Ret. Rer. Ret. Rer.
5 1,817 4,685 81 4,595 27 0.26 73 118 35 0.66 55.82 1.45 56.15 1.88
10 1,817 9,370 88 9,266 20 0.15 78 182 30 0.66 112.57 2.22 112.82 2.59
20 1,817 18,740 91 18,616 17 0.08 80 275 28 0.62 226.16 3.35 226.37 3.69
50 1,817 46,850 95 46,674 13 0.04 83 474 25 0.58 567.02 5.77 567.18 6.08
100 1,817 93,700 100 93,439 8 0.02 85 695 23 0.55 1135.15 8.47 1135.25 8.75
500 1,817 468,500 108 467,592 0 0.00 88 1,548 20 0.49 5680.51 18.84 5680.52 19.08
1000 1,817 937,000 108 823,043 0 0.00 89 2,031 19 0.47 9998.68 24.71 9998.68 24.95
Table 3: Comparison between retrieval-only and retrieve-and-rerank. kk denotes the number of most similar candidates retrieved by the bi-encoder that are re-ranked using the classification model. Results are averaged across 5 folds. The re-ranking stage substantially reduces error rates (FPR, SMR), which are scaled by ×104\times 10^{4} for readability.
Model Recall MRR
@10 @100 @1000 @10 @100 @1000
E5-small 0.528 ±\pm 0.055 0.634 ±\pm 0.035 0.756 ±\pm 0.031 0.481 ±\pm 0.058 0.485 ±\pm 0.058 0.485 ±\pm 0.058
E5-base 0.570 ±\pm 0.034 0.673 ±\pm 0.030 0.795 ±\pm 0.036 0.509 ±\pm 0.058 0.513 ±\pm 0.057 0.513 ±\pm 0.057
E5-large 0.609 ±\pm 0.053 0.715 ±\pm 0.039 0.826 ±\pm 0.017 0.554 ±\pm 0.043 0.557 ±\pm 0.042 0.558 ±\pm 0.042
Granite-107m 0.358 ±\pm 0.055 0.451 ±\pm 0.060 0.573 ±\pm 0.075 0.298 ±\pm 0.056 0.302 ±\pm 0.056 0.302 ±\pm 0.056
Granite-278m 0.427 ±\pm 0.053 0.538 ±\pm 0.052 0.650 ±\pm 0.061 0.377 ±\pm 0.055 0.381 ±\pm 0.055 0.381 ±\pm 0.055
SPhilBerta 0.468 ±\pm 0.032 0.631 ±\pm 0.030 0.778 ±\pm 0.035 0.409 ±\pm 0.045 0.415 ±\pm 0.045 0.415 ±\pm 0.045
BGE-M3 0.597 ±\pm 0.032 0.698 ±\pm 0.031 0.827 ±\pm 0.024 0.543 ±\pm 0.033 0.546 ±\pm 0.033 0.547 ±\pm 0.033
Table 4: Performance comparison of different base models for the dense retrieval task. All models were fine-tuned using the same negative sampling strategy. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs NN. Results are averaged across 5 folds.
Refer to caption
Figure 8: Impact of training data imbalance on retrieval. Heatmap showing recall, MAP, MRR, and NDCG scores at different cutoff values (@10, @100, @1000) across negative sampling proportions (1:1 to 1:10). Values are averaged over 5 cross-validation folds. Colors are normalized per column to highlight relative performance differences. Best values per metric are shown in bold with an orange underline. Higher ratios of negative samples generally improve retrieval performance, with optimal results often achieved at 1:5 or higher proportions.
Refer to caption
Figure 9: Hyperparameter sensitivity for the dense retriever. We visualize retrieval metrics (recall, MRR, MAP, and NDCG at k=1000k=1000) across varying learning rates and training epochs. Results are averaged across 5 folds, where annotated values represent the mean score ±\pm standard deviation.

C.2 Classification Results

We evaluate the binary classification (cross-encoder) stage using standard classification metrics alongside our task-specific error-based metrics (FPR, FNR, SMR). Table 5 demonstrated that sampling non-matching negatives for each query segment (qry,rnd\langle\text{qry},\text{rnd}\rangle) minimizes global error rates more effectively than random pairing (rnd,rnd\langle\text{rnd},\text{rnd}\rangle). Regarding architectures, Table 6 highlights the performance benefits of larger base models. Furthermore, Figure 10 shows that increasing negative sampling ratios improves robustness in imbalanced scenarios, while Figure 11 details the search of the grid over learning rates and epochs used to identify the optimal stability region for convergence.

Sampling Method Classification Metrics Global Error Rates Confusion Matrix
Prec. Rec. F1 Acc. FPR FNR SMR TP FP FN TN
Random pairs rnd,rnd\langle\text{rnd},\text{rnd}\rangle 0.05 0.91 0.09 0.94 0.0491 0.0002 0.0493 98 3149 10 60717
Random negatives qry,rnd\langle\text{qry},\text{rnd}\rangle 0.28 0.84 0.42 0.99 0.0050 0.0003 0.0053 91 317 17 63549
Hard negatives qry,sim\langle\text{qry},\text{sim}\rangle 0.20 0.77 0.30 0.99 0.0096 0.0004 0.0099 84 605 24 63262
Mixed negatives qry,mix\langle\text{qry},\text{mix}\rangle 0.26 0.85 0.35 0.80 0.2036 0.0003 0.2038 92 12553 16 51313
Table 5: Performance comparison of the binary classification model across different negative sampling strategies. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs NN. Results are averaged across 5 folds.
Base Model Classification Metrics Global Error Rates Confusion Matrix
Prec. Rec. F1 Acc. FPR FNR SMR TP FP FN TN
mmBERT Small 0.27 0.86 0.40 0.99 0.0058 0.0002 0.0061 93 372 15 63494
mmBERT Base 0.18 0.90 0.29 0.99 0.0110 0.0002 0.0111 98 705 10 63161
BERT-Romanian 0.21 0.82 0.33 0.99 0.0066 0.0003 0.0070 88 424 20 63442
PhilBerta 0.13 0.91 0.22 0.98 0.0191 0.0002 0.0192 99 1225 9 62641
LaBerta 0.05 0.79 0.09 0.96 0.0421 0.0004 0.0425 85 2679 23 61188
ModernBERT Large 0.27 0.85 0.39 0.99 0.0067 0.0002 0.0069 93 425 15 63441
ModernBERT Base 0.21 0.84 0.33 0.99 0.0082 0.0003 0.0085 91 522 17 63344
XLM-RoBERTa Large 0.36 0.87 0.50 1.00 0.0033 0.0002 0.0035 94 211 14 63656
XLM-RoBERTa Base 0.20 0.85 0.33 0.99 0.0072 0.0003 0.0075 92 460 16 63407
RoBERTa-Latin 0.00 0.47 0.01 0.62 0.3754 0.0009 0.3764 50 24003 58 39864
Table 6: Performance comparison of different pre-trained base models used in the binary classification stage. All models were fine-tuned using the same negative sampling strategy. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs NN. Results are averaged across 5 folds.
Refer to caption
Figure 10: Impact of training data imbalance on classification performance. The plot illustrates the recall metric on the evaluation set as the ratio of negative-to-positive training samples is increased from 1:11{:}1 to 1:101{:}10. Shaded areas represent the standard deviation across the 5 cross-validation folds.
Refer to caption
Figure 11: Hyperparameter sensitivity analysis for the classification model. We visualize key performance metrics (F1 score, accuracy, precision, and recall) across varying learning rates and training epochs. Results are averaged across 5 folds, where annotated values represent the mean score ±\pm standard deviation.

C.3 Combined Pipeline Results

Table 3 compares the retrieval-only baseline with the retrieve-and-rerank pipeline at varying retrieval depths (kk), relating performance gains to the number of required predictions.

Appendix D Python Package

To facilitate reproducibility and support the digital humanities community, we release the framework described in this paper as an open-source Python package. The locisimiles888https://github.com/julianschelb/locisimiles library streamlines the detection of intertextual links by implementing a standardized “retrieve-and-rerank” pipeline and calculating the task-specific error metrics (SMR, FPR, FNR) defined in Section B.1.

D.1 Python API

The core functionality allows researchers to load custom query and source documents (in CSV format) and execute the detection pipeline using pre-trained models from the Hugging Face Hub.

# 1. Load query and source documents
query_doc = Document("query.csv")
source_doc = Document("source.csv")

# 2. Initialize the pipeline
pipeline = ClassificationPipeline(
    classification_name="...",
)

# 3. Run the pipeline
results = pipeline.run(
    query=query_doc,
    source=source_doc,
)

# 4. Display results
pretty_print(results)

D.2 Graphical User Interface

To lower the barrier to entry, the package includes an optional Gradio-based GUI. This interface can be installed via the optional dependency group (pip install . [gui]) and launched directly from the command line by executing locisimiles-gui.

Workflow.

The application workflow is organized into three sequential stages, as illustrated in Figure 12:

  1. 1.

    Data Upload: Users ingest custom query and source documents via CSV files.

  2. 2.

    Configuration: The pipeline is customized by selecting pre-trained models and tuning retrieval parameters (e.g., retrieval depth kk and classification confidence thresholds).

  3. 3.

    Result Exploration: The interactive dashboard presents query segments alongside retrieved source candidates, displaying cosine similarity and classification probability scores, with functionality to export confirmed matches.

A Data Upload
Refer to caption
B Configuration
Refer to caption
C Interactive Result Exploration
Refer to caption
Figure 12: Graphical user interface workflow. (A) Data Upload: Users ingest query and source documents via CSV files. (B) Configuration: The pipeline is customized by selecting pre-trained models and tuning the retrieval depth (top-kk) and classification confidence threshold. (C) Result Exploration: The dashboard presents query segments alongside retrieved source candidates, displaying both cosine similarity and classification probability scores, with options to export matches as a CSV file.
Examples of Intertextual References Paraphrase (Minor)
Source Cicero, Catil. 1.1 “Quo usque tandem abutere, Catilina, patientia nostra?”
(How long, Catiline, will you abuse our patience?)
Reuse Jerome, Epist. 98.22 “… et patientia nostra quasi quodam temeritatis fomite abutentes …”
(… and abusing our patience like some kindling of rashness …)
  Comment: Jerome integrates Cicero’s famous invective syntactically by adapting the verb form (abutere \rightarrow abutentes).
  Paraphrase (Major)
Source Virgil, Georg. 4.82 “… ingentes animos angusto in pectore versant …”
(… they wield mighty souls in a tiny breast …)
Reuse Jerome, Epist. 107.13 “… et in paruis corpusculis ingentes animos intueri!”
(… and to see mighty souls in small bodies!)
  Comment: Jerome retains the semantic core but rephrases angusto in pectore to in parvis corpusculis.
  Allusion
Source Cicero, Orat. 33.11 “… sed nihil difficile amanti puto.”
(… but I think nothing is difficult for a lover.)
Reuse Jerome, Epist. 22.40 Nihil amantibus durum est, nullus difficilis cupienti labor.”
(Nothing is hard for lovers, no labor difficult for the desirous.)
  Comment: Jerome evokes the motif using synonymous but distinct vocabulary (difficile amanti vs. amantibus durum).
Figure 13: Example references. Three instances of text reuse by Jerome included in the ground truth dataset.