Loci Similes: A Benchmark for Extracting Intertextualities
in Latin Literature

Julian Schelb^†, Michael Wittweiler^⋄, Marie Revellio^‡∗,
Barbara Feichtinger^‡ and Andreas Spitz^†∗
^†Department of Computer and Information Science, University of Konstanz
^‡Department of Latin Philology, University of Konstanz
^⋄Department of Archaeology, Classical Philology and Ancient Studies, University of Zurich
{firstname.lastname}@uzh.ch
{firstname.lastname}@uni-konstanz.de

Abstract

Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of $\sim$ 172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.

^*^*footnotetext: These authors contributed equally as project lead

1 Introduction

Identifying intertextual connections between documents is an important task in classical philology, as it reveals how later works engage with earlier texts and traditions. For centuries, scholars detected intertextual references by relying on memory and the manual collation of Loci Similes, i.e., parallel passages that exhibit lexical, semantic, or thematic resemblance. Although digitization has augmented this process through lexical search tools, most approaches still depend on exact n-gram matching or heuristic filtering (DBLP:journals/dhq/SchroppKRF24). This is limiting for ancient texts, where intertextuality often manifests itself not as verbal quotation, but as subtle allusion, paraphrase, or thematic variation (DBLP:conf/latech/ManjavacasLK19; gong_augmented_2025), often complicated by orthographic volatility (Miller2025Alignment) (see Figure 2).

Figure 1: Example of intertextual reference. Reuse of a classic Vergilian phrase for speechlessness by Jerome. While retaining the semantic core, the author alters the word order to adapt the expression to a different context.

Recovering such textual reuses is not merely a matter of identifying sources. It facilitates research on broader cultural-historical phenomena (tangherlini2024travels). In particular, it supports work on reception and cultural hybridization in Late Antiquity, where pagan texts persist as the rhetorical substrate of elite writing while being recontextualized within emerging Christian discourse. Classical forms often remain recognizable even as their functions shift toward Christian meaning-making, a phenomenon visible in both syntactic stylometry (gorman2016approaching; DBLP:journals/corr/abs-2109-00601) and semantic motifs. In this sense, quotation patterns suggest how Christian authors do not abandon classical texts and their cultural contexts, but reuse their language, redirecting its meanings and connotations within Christian interpretive frameworks.

A case in point is the Church Father Jerome. Recent digital-hermeneutic studies have begun to map his “micro-quotations” (DBLP:journals/dhq/SchroppKRF24), yet the semantic breadth of his reuse remains a challenge. When Jerome alludes to the Augustan poet Virgil, he often retains the semantic core of a hexameter verse while altering its word order or syntax to suit his Christian prose context. In Jerome’s writings, the general tension between his pagan paideia and Christian discourse emerges with particular clarity in the details of how he quotes classical pagan sources and adapts them for his own texts (see Figure 1). Since canonical authors such as Virgil were deeply embedded in the educational curriculum, quoting them often served as a shorthand for shared cultural memory. Late Antique Christian writers like Jerome inherit this repertoire, but their reuse frequently reframes pagan language within Christian contexts, making citation patterns a measurable trace of shifting cultural authority.

Beyond their extraction, systematically mapping these dependencies allows scholars to reconstruct the “virtual library” available to an author, gaining insight into which sources most strongly shaped their writing. Furthermore, analyzing how these connections cluster, ranging from explicit citations to subtle echoes, helps refine theoretical definitions of intertextuality and investigate the rhetorical motivations behind text reuse. However, the development of automated methods for detecting intertextual connections with language models is held back by the absence of a standardized benchmark and easily accessible datasets.

In this paper, we take a step towards addressing this gap by introducing Loci Similes, a benchmark designed to enable researchers to systematically compare and evaluate computational approaches for intertextuality detection.

Our paper makes the following key contributions:

•

Curated benchmark dataset of $\sim$ 172k Latin text segments, partitioned into a query and source corpora, accompanied by a ground truth dataset of $545$ expert-verified intertextual links.
•

Evaluation framework that shifts the paradigm from standard query-target sentence matching of Information Retrieval to whole-document comparison, aligning more closely with the practical constraints of philological workflows.
•

Baseline results for embedding models, classification models, and an end-to-end pipeline for intertextuality detection, serving as a foundation for future comparisons.

2 Related Work

2.1 Intertextuality Detection in Latin

Previous research on Latin intertextuality has mainly relied on matching word-level embeddings. DBLP:conf/naacl/BurnsBLCD21 utilized static Word2Vec models trained on lemmatized text to rank potential intertextual phrases, evaluating performance against a dataset of 945 parallels from Valerius Flaccus’ Argonautica dexter_ldquodatabase_2024. Although this approach outperformed traditional lexical methods in capturing semantic similarity, it is limited by the inability of static embeddings to handle context-dependent polysemy.

Addressing this limitation, gong_augmented_2025 substituted static vectors with LatinBERT DBLP:journals/corr/abs-2009-10053, a transformer-based model generating context-aware embeddings. By computing similarities between source words and target bigrams, their system successfully identified allusions in Lucan’s Pharsalia, with a user study confirming its utility for philological interpretation compared to alternative tools. DBLP:conf/latech/ManjavacasLK19 modeled the detection of biblical allusions in Latin sermons as an information retrieval task, finding that semantic FastText embeddings improve ranking over lexical baselines. DBLP:conf/chr/ManjavacasKK20 moved beyond pure retrieval to analyze the contextual drivers of intertextuality in the Patrologia Latina by modeling lexical similarity and thematic embedding as separate axes.

revellio2022zitate introduced a mixed digital-hermeneutic approach, designed to uncover previously undetected Vergilian citations and allusions in Jerome’s letter corpus and, in doing so, to examine Jerome’s patterns of quotation and adaption while probing the outer limits of quotation itself. Building on this foundation, DBLP:journals/dhq/SchroppKRF24 further elaborate this procedure at the level of automated pre-selection: starting from “Simillima’s” raw two-word matches, they expand and recalibrate the cascade of rule-based filters (e.g., stop-lists, part-of-speech constraints, and lemma distance criteria) in order to make the identification of very short “micro-quotations” feasible at scale.

Both approaches are restricted to n-grams. In contrast, DBLP:conf/acl/Riemenschneider23 applied the Sentence Transformer architecture DBLP:conf/emnlp/ReimersG19 to ancient languages. They introduced SPhilBerta, a multilingual model fine-tuned on parallel sentences in Ancient Greek, Latin, and English, framing the task as binary classification over sentence pairs.

DBLP:journals/corr/abs-2109-00601 used Latin stylometry by generating contextual embeddings with LatinBERT for a corpus of 39 classical and medieval texts. Their unsupervised clustering approach in authorship attribution offered quantitative evidence connecting the anonymous Gesta Principum Polonorum to the Monk of Lido.

2.2 Text Reuse in Other Languages

Beyond the specific domain of Latin literature, automated detection of text reuse has been explored across languages and tasks.

Existing general-purpose frameworks draw on literary theory to distinguish between explicit citations and implicit commentary using graph-based approaches DBLP:journals/coling/KuznetsovBEG22, and take advantage of LLM-assisted metadata extraction and vector similarity to detect explicitly marked and implicit cross-document influences DBLP:journals/corr/abs-2410-15145. In general, current trends indicate that traditional lexical methods are outperformed by contextual models; for example, DBLP:conf/starsem/MacLaughlinXS21 modeled reuse detection as a binary classification task using fine-tuned BERT and Longformer models.

Within computational literary studies, researchers have applied contextual models to track reuse between genres and periods. In the bibliometric domain, DBLP:conf/ecir/BertinA18 constructed InTeReC, a corpus of more than 300,000 sentences with citations extracted from English-language PLOS articles. For French fiction, DBLP:journals/corr/abs-2410-17759 applied language models to a corpus of 12,000 novels from the 18th to early 20th centuries. In the Scandinavian context, tangherlini2024travels mapped links between Hans Christian Andersen’s travelogs and fairy tales using Sentence-Transformers and BERTopic to uncover latent motifs. Additionally, liebl-burghardt-2020-shakespeare identified Shakespearean quotes and paraphrases within contemporary fiction.

In addition to general-purpose methods, specialized approaches address low-resource languages and their unique challenges. Miller2025Alignment enhanced reuse detection in Hebrew and Aramaic manuscripts by incorporating fastText embeddings, while gorman2016approaching modeled authorial style in Ancient Greek by analyzing “syntax words” via unsupervised hierarchical clustering. Additionally, DBLP:journals/talip/SharjeelMNNR23 detected cross-lingual reuse in English-Urdu news articles by combining machine translation with classifiers.

2.3 Further Related Tasks

The detection of intertextualities shares concepts with other NLP tasks that identify dependencies or similarities between texts: quote detection DBLP:conf/nodalida/JanickiKM23, paraphrase identification DBLP:journals/es/VrbanecM23, and document alignment DBLP:conf/eacl/MolfeseBTCN24.

Quote detection typically targets direct, explicitly marked quotations within contemporary corpora DBLP:conf/konvens/PetersenFreyB24; DBLP:conf/lrec/ZhangL22; DBLP:conf/ranlp/PapayP19; DBLP:conf/wsdm/VaucherSC021. This contrasts with our focus on historical intertextuality, which often relies on subtle, unmarked allusions rather than verbatim reuse. Similarly, paraphrase detection identifies semantic similarity often in the context of plagiarism DBLP:conf/paclic/MahmoudZ22; DBLP:conf/emnlp/WahleRKG22; DBLP:journals/corr/abs-2303-13989; DBLP:conf/emnlp/WahleGR23. However, intertextuality involves recontextualization, requiring the detection of stylistic echoes where the semantic core has shifted to serve a new argument. Finally, document alignment focuses on matching entire documents between parallel texts or translations DBLP:conf/ranlp/RajithaPSR21; DBLP:journals/corr/abs-2510-15577. In contrast, intertextuality detection operates at a granular level, identifying isolated references between documents that are otherwise structurally and semantically distinct.

Figure 2: Spectrum of intertextuality. References manifest in diverse forms, spanning from easily detectable verbatim quotations to adapted paraphrases and subtle allusions where only a semantic core remains.

3 Dataset Construction

While traditional scholarship has documented numerous Latin intertextual parallels, computational research remains constrained by the lack of standardized benchmark datasets. Aggregating these references is non-trivial, as the data is dispersed across commentaries, indices locorum (“back-of-the-book indexes”), and focused philological case studies. To address this gap, we curated a corpus of $\sim$ 172k text segments spanning works by multiple Latin authors (see Table 1) and a ground truth dataset of 545 confirmed intertextual links identified within these texts (see Figure 3).

3.1 Latin Corpus Curation

We divide the corpus into a separate Query Corpus and a Source Corpus. The former comprises works by the Late Antique authors Jerome and Lactantius, totaling $\sim$ 83k text segments. The latter, which serves as the retrieval target, consists of $\sim$ 88k segments drawn from ten canonical classical Latin authors: Cicero, Lucretius, Catullus, Virgil, Horace, Tibullus, Propertius, Ovid, Lucan, and Martial. A detailed breakdown of segment counts and token statistics for each author is presented in Table 1. The texts were aggregated from three primary digital repositories: Corpus Corporum, the Tesserae Project, and the OpenGreekandLatin Project. For a complete listing of the specific works, critical editions utilized, and their respective provenance, see the detailed breakdown in Appendix A (Table 2). Since we exclusively rely on publicly available texts to compile the corpus, we released the corpus files alongside the ground-truth dataset.

3.2 Ground Truth Construction

To construct our ground truth dataset of 545 confirmed intertextual links, we draw from two complementary sources.

First, we sourced 270 instances of references to Virgil and Cicero in Jerome’s Epistulae from the dataset¹¹1https://doi.org/10.11588/data/FVCULR established by DBLP:journals/dco/SchroppWKRF24, which generally fall into three categories: verbatim quote, paraphrase, and allusion (see Figure 2). Notably, we do not retain all entries from the original dataset on a one-to-one basis. While DBLP:journals/dco/SchroppWKRF24 treat each poetic verse as a separate unit, our dataset operates at an approximate sentence level to ensure alignment between source and reused material. Consequently, we consolidated multiple entries into a single instance in cases where citations span two or more verses. Furthermore, our expert annotators rejected links from previous scholarship that were deemed controversial or lacking sufficient evidence of intertextuality.

Second, we augmented these explicitly attested references by adding 275 additional intertextual links. We first identify candidates through a rule-based n-gram matching approach published by DBLP:journals/dhq/SchroppKRF24.²²2https://github.com/MWittweiler/citation_analysis Their pipeline identifies candidate intertextual pairs by scanning for shared, non-contiguous tokens within a defined window. To ensure meaningful matches, candidates are then refined through a cascade of filters that remove high-frequency stopwords, enforce part-of-speech constraints, and exclude generic collocations based on embedding similarity. Subsequently, the candidates are manually evaluated by domain experts, and only those confirmed to constitute meaningful intertextual links are retained in the dataset.

Refer to caption — Figure 3: Distribution of confirmed references. Manually verified intertextual links in the annotated dataset by citing author (Jerome, Lactantius) and source author (including Virgil, Cicero, and others).

Author	Segments	Avg. Tokens	Min	Max	Std. Dev.
Query Corpus
Jerome	74,672	30.94	1	582	22.14
Lactantius	8,444	28.07	1	369	19.01
Total / Avg.	83,116	30.65	1	582	21.86
Source Corpus
Cicero	54,331	28.61	1	847	24.77
Ovid	14,096	25.88	1	396	18.81
Virgil	4,861	29.49	1	351	19.50
Martial	4,114	24.60	2	249	19.28
Lucan	2,952	31.24	2	243	21.80
Horace	2,353	32.39	2	193	24.09
Propertius	1,889	22.12	2	267	17.51
Lucretius	1,826	40.74	3	255	27.25
Catullus	809	27.34	2	149	22.00
Tibullus	788	26.09	2	251	20.66
Total / Avg.	88,019	28.30	1	847	23.20

Table 1: Corpus statistics by author. Breakdown of the dataset into Query (citing authors) and Source (referenced authors) corpora, showing the number of segments and token statistics for each author.

3.3 Annotation Process

Although the filtering process of the candidate generation pipeline removed many false positives, numerous candidates still involved common collocations (e.g., puncto temporis “in a moment”) and required manual exclusion. Therefore, candidates identified by the $n$ -gram pipeline were annotated by a team of four experts in Latin literature (two pre-PhD and two post-PhD researchers), who assessed whether the lexical overlap constituted a meaningful intertextual reference or merely a coincidental overlap or common phrase. Conflicts in the annotation were resolved through group discussion. Our expert annotators used the following three criteria to guide their annotation in identifying meaningful intertextual links:

1.

Use of Uncommon Vocabulary: If the lexical overlap consists of rare or marked words, the likelihood increases that these elements were deliberately borrowed.
2.

Attested Frequency: A specialized Latin corpus database³³3http://clt.brepolis.net/llta/Search served to help determine the frequency of overlapping expression in Latin literature. Expressions appearing exclusively in the candidate source and target passages were treated as strong indicators of a unique intertextual relationship.
3.

Conduit Function: Most importantly, if the cited expression contributes semantic, rhetorical, or cultural information from the source text that cannot be derived from the target passage in isolation, thereby implicitly enriching the interpretation, it is viewed as intertextual.

We release the complete dataset including the query texts, source corpus, and annotated intertextual links as a HuggingFace dataset.⁴⁴4https://huggingface.co/collections/julian-schelb/datasets-for-latin-intertextuality-search.

4 Evaluation Framework

To benchmark intertextuality detection methods, we also developed an evaluation framework that serves as the basis for the baseline experiments in this study. Our motivation is to reframe the task as a segment-wise comparison of whole documents since this formulation is a more appropriate representation of the work of scholars analyzing historical texts than traditional information retrieval pipelines.

4.1 Task Definition

We define the task of intertextuality detection as the identification of directional dependencies between two texts: a query document (the chronologically later text) and a source document (the earlier text). We formulate this as a retrieval and alignment problem where the objective is to map each text segment in the query document to a candidate set of text segments in the source document. This mapping is inherently one-to-many; a single query sentence may correspond to zero, one, or multiple distinct segments in the source text, exhibiting either semantic equivalence or stylistic imitation.

4.2 Evaluation Metrics

Since we compare all text segments of the query and source documents, and given that many queries will correspond to no intertextual links, we require metrics that effectively account for the correct rejection of non-links (true negatives). We therefore employ error-based metrics normalized by the total number of sentence pairs ( $N$ ): Segment-Misclassification Rate (SMR), which serves as a global error rate; Global False-Positive Rate (FPR), measuring spurious matches; and Global False-Negative Rate (FNR), measuring missed references. The formal definitions and equations for these metrics are provided in Appendix B.1.

4.3 Framework Features

The locisimiles package streamlines the end-to-end workflow by allowing users to load query and source documents split into text segments, which are then compared to generate a list of potential intertextual links. The library provides reference implementations for three core architectures: a classification-only pipeline, a retrieval-only pipeline, and a hybrid retrieve-and-rerank pipeline. Beyond these baselines, the framework is designed to be extensible, allowing researchers to implement and test custom approaches by adding their own pipelines. When a ground-truth dataset is provided, the package evaluates a given pipeline using the error-based metrics described above, which are selected to reflect the reality of scholarly editorial work and the standard classification metrics. The software is released as a standard Python package that includes a graphical user interface (GUI) component for ease of use and is available as the open-source package locisimiles⁵⁵5https://github.com/julianschelb/locisimiles. See Appendix D for details on how to use the package.

5 Baseline Methods

To provide a best-effort baseline for the extraction of intertextualities, we frame the identification of intertextual links as an Information Retrieval (IR) task (see Figure 4). Given a query text segment from the author under investigation (e.g., Jerome), our system queries an index of source texts (e.g., Virgil) to identify potential allusions. The pipeline operates in two phases: first, a bi-encoder model performs dense retrieval to generate a candidate set of potential references per query; second, a classification model performs binary classification to re-rank the top- $k$ candidates. This approach allows us to filter vast amounts of text rapidly using precomputed embeddings while reserving computationally expensive comparisons using the classification model for the most promising candidates.

5.1 Information Retrieval

Unlike traditional methods that rely on sparse lexical overlap, we leverage language models to encode text segments into dense vector representations. Specifically, we treat text segments from the author under investigation (e.g., Jerome) as queries to be matched against an index of source texts (e.g., Virgil). We employ sentence transformer models to generate embeddings for both the query and source segments. During inference, the bi-encoder is used to obtain the cosine similarity between the query embedding and every source embedding in the database. The candidates are then ranked by their similarity scores, identifying the most likely intertextual references.

Training setup.

To adapt the bi-encoder for this specific domain, we fine-tuned the model using Online Contrastive Loss to produce similar embeddings for pairs constituting an intertextual link and dissimilar embeddings for unrelated segments. Fine-tuning was performed over 4 epochs with a batch size of 32, using a learning rate of $2\times 10^{-5}$ and applying a weight decay of 0.01 to prevent overfitting. Consistent with state-of-the-art dense retrieval approaches DBLP:conf/acl/SuSKWHOYSZ023, we distinguish the asymmetric roles of text pairs by prepending the prefixes “Query: ” to target text segments and “Candidate: ” to source text segments prior to encoding.

5.2 Binary Classification

Following the initial retrieval stage, we employ a cross-encoder architecture to re-rank the top- $k$ candidates. Unlike the bi-encoder, which generates static embeddings for query and source segments independently, the cross-encoder processes the query and candidate as a single paired input DBLP:conf/emnlp/ReimersG19. This allows the self-attention mechanism to directly compare the two segments at the token level and capture interactions and dependencies between them.

Input construction.

To construct the input for the binary classifier, we concatenate the target sentence (query) and the retrieved source sentence (candidate), separated by the model’s specific special tokens (see Figure 4). To prevent either segment from dominating the input window, we truncate both the query and candidate segments to $50\%$ of the available tokens budget before concatenation.

Training setup.

We fine-tuned the model to predict the probability that a given pair constitutes a valid intertextual link. The model was trained for 4 epochs with a batch size of 32, using the AdamW optimizer with a learning rate of $2\times 10^{-5}$ . The model minimizes the cross-entropy loss between the predicted similarity scores and the ground truth binary labels.

6 Experimental Setup

In this section, we utilize the dataset constructed in Section 3 and the evaluation framework described in Section 4 to establish baseline results for both information retrieval and classification approaches.

6.1 Dataset Split

To ensure statistical robustness, we employ $k$ -fold cross-validation ( $k=5$ ) on the set of 545 verified positive text segment pairs. For each training fold, we generate contrastive examples by augmenting the assigned positive pairs with $n$ negative samples. In the evaluation phase, we reconstruct a realistic retrieval scenario to test the model’s ability to distinguish genuine citations from background noise: the query document combines the fold’s held-out positive instances with a subset of the query corpus (totaling 937 segments), while the source document consists of 880 source segments.

6.2 Model Configuration and Ablations

To identify the optimal pipeline architecture, we conducted extensive independent evaluations of the dense retrieval and binary classification stages. Details on base model evaluation, negative sampling ablation studies, and hyperparameter configurations are provided in Appendix B.

•

Base Models: We evaluated a range of pre-trained language models to isolate the effects of domain specificity versus model size.
•

Negative Sampling: We compared three distinct training strategies (random pairs, random negatives, and hard negatives) while varying the positive-to-negative ratio from $1{:}1$ to $1{:}10$ to optimize the model’s discrimination capability.
•

Combined Pipeline: For the end-to-end pipeline, we varied the retrieval depth $k$ to identify the operating point that minimizes recall loss while maximizing computational efficiency.

7 Experimental Results

In this section, we present our quantitative findings.

7.1 Information Retrieval Results

We evaluated the impact of training data imbalance by varying the positive-to-negative sample ratio from $1{:}1$ to $1{:}10$ . Performance improved consistently with more negative examples, and the $1{:}10$ ratio yielded optimal results, suggesting potential gains from further increasing the negative sample proportion. Comparing base models showed that large multilingual models outperformed native Latin baselines. E5-large proved to be the most effective, retrieving approximately 61% of relevant references at Recall@10 (72% at Recall@100, 83% at Recall@1000). For a detailed evaluation of negative sampling and base models, see Appendix C.1.

7.2 Binary Classification Results

Training with hard negatives ( $\langle\text{qry},\text{sim}\rangle$ ) minimized error rates more effectively than random pairing. This approach enforces finer semantic distinctions, allowing the model to better distinguish genuine intertextuality from mere topical similarity. Second, we analyzed the impact of the imbalance in the training data by varying the ratio of negative-to-positive samples from $1{:}1$ to $1{:}10$ . Consistent with our findings in the retrieval stage, increasing the volume of negative examples improves the robustness of the model in imbalanced scenarios. Finally, in terms of base model architectures, XLM-RoBERTa Large achieves an average F1-score of 0.5 along with the lowest average SMR of 35 misclassified links per 10,000 candidates. For detailed evaluation results, see Appendix C.2.

7.3 Retrieve and Rerank Results

We evaluate the full two-stage retrieve-and-rerank pipeline, which first generates $k$ candidates via embedding cosine similarity, followed by a binary classification stage to label pairs as Reference or No Reference. As a baseline, we consider a “retrieval only” approach where the top $k$ candidates are treated as positive predictions, simulating the workload of a scholar manually reviewing the top results. Experiments are averaged across 5 folds. For this experiment, we use a larger corpus of 937 query and 880 source segments ( $\sim$ 109 ground-truth references per fold). As shown in Figure 5, the “retrieve-and-rerank” pipeline reduces false positives compared to the retrieval-only approach. At $k=5$ , it achieves an F1-score of 0.66 versus 0.26, and maintains an F1 of 0.55 at $k=100$ where pure retrieval drops to 0.02. Crucially, the SMR remains orders of magnitude lower (0.0019 vs. 0.57 at $k=500$ ), indicating that the classifier rejects the vast majority of spurious candidates.

In practical terms, with a reasonable retrieval depth of $k=100$ , the retrieve-and-rerank pipeline recovers 79% of true intertextual references (85 out of 108). A scholar utilizing this system would need to manually review 780 candidate passages to verify the 85 genuine intertextualities, instead of reviewing the prohibitively large number of 93,700 cases with the retrieval-only pipeline (see Table 3). This corresponds to a 99% reduction in workload while retaining nearly 80% of the relevant data.

8 Discussion

Although the bi-encoder reliably retrieves long literal citations and thematic allusions with strong semantic cores, it struggles when minimal lexical reuse occurs in divergent contexts. This results in low similarity scores and therefore low rankings for valid candidates, causing them to be discarded before the classification stage. In general, larger multilingual models trained with higher negative sampling ratios achieve superior retrieval performance. The classifier correctly identifies these low-overlap references if they pass the retrieval step. We observe consistently high recall but low precision, suggesting that classification models struggle to distinguish meaningful intertextuality from the coincidental use of common phrases.

9 Conclusion

In this work, we introduced Loci Similes, a benchmark dataset for Latin intertextuality detection designed to evaluate the capacity of language models to capture semantic similarity beyond exact lexical matching. Our baseline experiments demonstrate that while dense retrieval effectively identifies long literal citations and thematic allusions, detecting subtle “two-word congruencies” remains challenging. Among the evaluated models, E5-large achieved the best retrieval and XLM-RoBERTa Large the best classification results. Overall, our findings suggest that language models offer a promising avenue for this task, but the primary challenge lies in distinguishing meaningful reuse from coincidental lexical overlap, which will require further advances and the development of dedicated architectures for the detection of intertextualities.

Future Work.

We aim to expand our dataset by incorporating existing collections of intertextual links in Latin literature (e.g., dexter_ldquodatabase_2024 and DBLP:journals/dco/SchroppWKRF24) to better enable models to capture wide ranges of stylistic deviations.

Limitations

Data Coverage.

Although our dataset comprises expert-verified positive pairs, it is not (and cannot be) exhaustive. The labeled dataset likely omits some valid references between the selected works. Consequently, it is possible that a few instances classified as “false positives” may represent genuine but undocumented intertextual links, potentially skewing the reported precision.

Labeling Ambiguity.

Defining what constitutes a “reference” in Classical Philology remains a substantial methodological challenge. The boundaries between literal citation, subtle allusion, and general thematic resonance are fluid and subject to differing scholarly definitions. This inherent ambiguity affects the consistency of manual labeling, as annotators may prioritize differing criteria for intertextuality. While the taxonomy in Figure 2 provides a useful analytical framework, it does not apply uniformly across all instances of intertextuality. Many cases exhibit characteristics of multiple categories, and only a minority can be unambiguously assigned to a single class.

Acknowledgments

This work was supported by the German Research Foundation (DFG) in the project “Zitieren als narrative Strategie. Eine digital-hermeneutische Untersuchung von Intertextualitätsphänomenen am Beispiel des Briefcorpus des Kirchenlehrers Hieronymus” [Grant ID: 382880410].

AI Usage Statement

Language model-based AI tools (Codex and Github Copilot) were used as coding assistants in the implementation and as writing assistants in drafting parts of the manuscript. The final version of the manuscript was written without AI input.

References

Appendix A Corpus Sources

The texts used to construct our corpus were aggregated from three digital repositories: Corpus Corporum⁶⁶6https://mlat.uzh.ch/, the Tesserae Project⁷⁷7https://github.com/tesserae/tesserae, and the OpenGreekandLatin Project. Table 2 details the specific editions used and Figure 13 shows some examples.

Author	Work	Source	Edition
Virgil	Aeneid	CC	Greenough (1900)
Virgil	Georgics	CC	Greenough (1900)
Virgil	Eclogues	CC	Greenough (1900)
Ovid	Amores	CC	Ehwald (1907)
Ovid	Ars Amatoria	CC	Ehwald (1907)
Ovid	Ex Ponto	CC	Wheeler (1939)
Ovid	Fasti	CC	Frazer (1933)
Ovid	Heroides	CC	Ehwald (1907)
Ovid	Ibis	CC	Merkel/Ehwald (1889)
Ovid	Medicamina	CC	Ehwald (1907)
Ovid	Metamorphoses	CC	Magnus (1892)
Ovid	Remedia Amoris	CC	Ehwald (1907)
Ovid	Tristia	CC	Wheeler (1939)
Martial	Epigrammata	CC	Heraeus (1925)
Lucretius	De Rerum Natura	CC	Martin (1934)
Lucan	Pharsalia	CC	Weise (1835)
Horace	Carmen Saeculare	CC	Shorey (1898)
Horace	Carmina	CC	Shorey (1919)
Horace	Ars Poetica	CC	Smart (1836)
Horace	Epistulae	CC	Fairclough (1929)
Horace	Epodes	CC	Vollmer (1912)
Horace	Saturae	CC	Smart (1836)
Catullus	Carmina	Tess	Merrill
Propertius	Elegiae	Tess	Mueller (1898)
Tibullus	Elegiae	Tess	Postgate (1915)
Cicero	Opera Omnia	Tess	Varia
Jerome	Epistulae	OGL	Hilberg (1910)
Jerome	Varia	CC	Patrologia Latina (1845)

Table 2: Corpus data sources. Abbreviations: CC (Corpus Corporum), Tess (Tesserae Project), and OGL (OpenGreekandLatin Project).

Appendix B Experimental Setup Details

Experiments on the classification model and the full retrieve-and-rerank pipeline utilize the directional comparison task between query and source documents, evaluated using 5-fold cross-validation on the 545 verified positive pairs.

B.1 Evaluation Metrics

Unlike standard information retrieval tasks that focus on ranking top- $k$ candidates, our objective is to classify the entire set of possible links between the query and source documents. Since many text segments in the query document do not have a true positive match in the source document, we require metrics that account for the correct rejection of non-links (true negatives). Consequently, we define $N=\text{TP}+\text{FP}+\text{FN}+\text{TN}$ as the total number of text segment pairs and introduce the following error-based metrics to align more closely with the practical constraints of philological workflows.

•

Segment-Misclassification Rate (SMR): defined as the fraction of all query-source pairs that were misclassified. This serves as a global error rate. Values range from $0$ (perfect retrieval) to $1$ (complete failure).

$\text{SMR}=\frac{\text{FP}+\text{FN}}{N}$
•

Global False-Positive Rate (FPR): defined as the share of the total dataset incorrectly predicted as links. A high FPR indicates a system prone to “over-generating” candidate links.

$\text{FPR}=\frac{\text{FP}}{N}$
•

Global False-Negative Rate (FNR): defined as the share of the total dataset that contains true links missed by the system. A high FNR indicates that genuine intertextual references are remaining undiscovered.

$\text{FNR}=\frac{\text{FN}}{N}$

These metrics allow us to decompose the total error (SMR) into its constituent types (FPR and FNR), providing insights into whether the model is biased towards over-generation or under-retrieval. We calculate these metrics individually for each query segment and report the mean value averaged over all queries.

B.2 Base Model

To identify the most effective architectures for our intertextuality detection pipeline, we conducted a comparative analysis of pre-trained models for both pipeline stages.

Retrieval Models.

For the embedding stage, we evaluated the multilingual E5 family (Small, Base, and Large) DBLP:journals/corr/abs-2402-05672, the Granite embedding models (107m and 278m) DBLP:journals/corr/abs-2502-20204, and BGE-M3 DBLP:journals/corr/abs-2402-03216. Additionally, we included SPhilBerta DBLP:conf/acl/Riemenschneider23 to benchmark a domain-adapted model pre-trained on classical languages.

Classification Models.

For the rerank stage, we selected ten architectures to analyze the trade-off between multilingual generalization and domain-specific pre-training. Our multilingual baselines include the XLM-RoBERTa family (Base and Large) DBLP:conf/acl/ConneauKGCWGGOZ20, ModernBERT (Base and Large) DBLP:conf/acl/WarnerCCWHTGBLA25, and mmBERT (Small and Base) DBLP:journals/corr/abs-2509-06888. To measure the impact of domain adaptation, we evaluated PhilBerta DBLP:conf/acl/Riemenschneider23, LaBerta, and RoBERTa-Latin, which are pre-trained on Latin corpora. Finally, we included BERT-Romanian DBLP:conf/emnlp/DumitrescuAP20 to test cross-lingual transfer from a related Romance language.

B.3 Negative Sampling Methods

The quality of dense cross-encoder models depends on the negative examples used during training. We evaluated three negative sampling strategies:

•

Random pairs ( $\langle\text{random},\text{random}\rangle$ ): Pairs are formed by selecting two completely disjoint segments at random from the corpus.
•

Random negatives ( $\langle\text{query},\text{random}\rangle$ ): For each positive query, we sample a negative candidate uniformly at random from the remaining corpus.
•

Hard negatives ( $\langle\text{query},\text{similar}\rangle$ ): We utilize a pre-trained embedding model to identify “hard negatives”, candidates that are semantically similar to the query but are not true intertextual references.

Appendix C Experimental Results

This section presents the detailed quantitative findings of our ablation studies, benchmarking the impact of negative sampling, model architecture, and hyperparameter tuning across both the retrieval and classification stages.

C.1 Information Retrieval Results

We evaluate the dense retrieval (bi-encoder) component using information retrieval metrics to ensure high-quality candidate filtering. As shown in Table 4 and Figure 6, larger multilingual models consistently achieve higher recall scores. Figure 7 further details this performance by displaying the recall scores for the best-performing model for each fold. Complementing this, Figure 8 indicates that increasing negative training ratios improves recall, while Figure 9 isolates the best-performing learning rate and epoch configuration.

$k$	Ret.	Rer.	TP	FP	FN	F1	TP	FP	FN	F1	Ret.	Rer.	Ret.	Rer.
	# Predictions		Retrieval Only				Retrieve+Rerank				FPR $\times 10^{4}$		SMR $\times 10^{4}$
5	1,817	4,685	81	4,595	27	0.26	73	118	35	0.66	55.82	1.45	56.15	1.88
10	1,817	9,370	88	9,266	20	0.15	78	182	30	0.66	112.57	2.22	112.82	2.59
20	1,817	18,740	91	18,616	17	0.08	80	275	28	0.62	226.16	3.35	226.37	3.69
50	1,817	46,850	95	46,674	13	0.04	83	474	25	0.58	567.02	5.77	567.18	6.08
100	1,817	93,700	100	93,439	8	0.02	85	695	23	0.55	1135.15	8.47	1135.25	8.75
500	1,817	468,500	108	467,592	0	0.00	88	1,548	20	0.49	5680.51	18.84	5680.52	19.08
1000	1,817	937,000	108	823,043	0	0.00	89	2,031	19	0.47	9998.68	24.71	9998.68	24.95

Table 3: Comparison between retrieval-only and retrieve-and-rerank.

k

denotes the number of most similar candidates retrieved by the bi-encoder that are re-ranked using the classification model. Results are averaged across 5 folds. The re-ranking stage substantially reduces error rates (FPR, SMR), which are scaled by

\times 10^{4}

for readability.

Model	Recall			MRR
Model	@10	@100	@1000	@10	@100	@1000
E5-small	0.528 $\pm$ 0.055	0.634 $\pm$ 0.035	0.756 $\pm$ 0.031	0.481 $\pm$ 0.058	0.485 $\pm$ 0.058	0.485 $\pm$ 0.058
E5-base	0.570 $\pm$ 0.034	0.673 $\pm$ 0.030	0.795 $\pm$ 0.036	0.509 $\pm$ 0.058	0.513 $\pm$ 0.057	0.513 $\pm$ 0.057
E5-large	0.609 $\pm$ 0.053	0.715 $\pm$ 0.039	0.826 $\pm$ 0.017	0.554 $\pm$ 0.043	0.557 $\pm$ 0.042	0.558 $\pm$ 0.042
Granite-107m	0.358 $\pm$ 0.055	0.451 $\pm$ 0.060	0.573 $\pm$ 0.075	0.298 $\pm$ 0.056	0.302 $\pm$ 0.056	0.302 $\pm$ 0.056
Granite-278m	0.427 $\pm$ 0.053	0.538 $\pm$ 0.052	0.650 $\pm$ 0.061	0.377 $\pm$ 0.055	0.381 $\pm$ 0.055	0.381 $\pm$ 0.055
SPhilBerta	0.468 $\pm$ 0.032	0.631 $\pm$ 0.030	0.778 $\pm$ 0.035	0.409 $\pm$ 0.045	0.415 $\pm$ 0.045	0.415 $\pm$ 0.045
BGE-M3	0.597 $\pm$ 0.032	0.698 $\pm$ 0.031	0.827 $\pm$ 0.024	0.543 $\pm$ 0.033	0.546 $\pm$ 0.033	0.547 $\pm$ 0.033

Table 4: Performance comparison of different base models for the dense retrieval task. All models were fine-tuned using the same negative sampling strategy. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs

N

. Results are averaged across 5 folds.

C.2 Classification Results

We evaluate the binary classification (cross-encoder) stage using standard classification metrics alongside our task-specific error-based metrics (FPR, FNR, SMR). Table 5 demonstrated that sampling non-matching negatives for each query segment ( $\langle\text{qry},\text{rnd}\rangle$ ) minimizes global error rates more effectively than random pairing ( $\langle\text{rnd},\text{rnd}\rangle$ ). Regarding architectures, Table 6 highlights the performance benefits of larger base models. Furthermore, Figure 10 shows that increasing negative sampling ratios improves robustness in imbalanced scenarios, while Figure 11 details the search of the grid over learning rates and epochs used to identify the optimal stability region for convergence.

Sampling Method	Classification Metrics				Global Error Rates			Confusion Matrix
Sampling Method	Prec.	Rec.	F1	Acc.	FPR	FNR	SMR	TP	FP	FN	TN
Random pairs $\langle\text{rnd},\text{rnd}\rangle$	0.05	0.91	0.09	0.94	0.0491	0.0002	0.0493	98	3149	10	60717
Random negatives $\langle\text{qry},\text{rnd}\rangle$	0.28	0.84	0.42	0.99	0.0050	0.0003	0.0053	91	317	17	63549
Hard negatives $\langle\text{qry},\text{sim}\rangle$	0.20	0.77	0.30	0.99	0.0096	0.0004	0.0099	84	605	24	63262
Mixed negatives $\langle\text{qry},\text{mix}\rangle$	0.26	0.85	0.35	0.80	0.2036	0.0003	0.2038	92	12553	16	51313

Table 5: Performance comparison of the binary classification model across different negative sampling strategies. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs

N

. Results are averaged across 5 folds.

Base Model	Classification Metrics				Global Error Rates			Confusion Matrix
Base Model	Prec.	Rec.	F1	Acc.	FPR	FNR	SMR	TP	FP	FN	TN
mmBERT Small	0.27	0.86	0.40	0.99	0.0058	0.0002	0.0061	93	372	15	63494
mmBERT Base	0.18	0.90	0.29	0.99	0.0110	0.0002	0.0111	98	705	10	63161
BERT-Romanian	0.21	0.82	0.33	0.99	0.0066	0.0003	0.0070	88	424	20	63442
PhilBerta	0.13	0.91	0.22	0.98	0.0191	0.0002	0.0192	99	1225	9	62641
LaBerta	0.05	0.79	0.09	0.96	0.0421	0.0004	0.0425	85	2679	23	61188
ModernBERT Large	0.27	0.85	0.39	0.99	0.0067	0.0002	0.0069	93	425	15	63441
ModernBERT Base	0.21	0.84	0.33	0.99	0.0082	0.0003	0.0085	91	522	17	63344
XLM-RoBERTa Large	0.36	0.87	0.50	1.00	0.0033	0.0002	0.0035	94	211	14	63656
XLM-RoBERTa Base	0.20	0.85	0.33	0.99	0.0072	0.0003	0.0075	92	460	16	63407
RoBERTa-Latin	0.00	0.47	0.01	0.62	0.3754	0.0009	0.3764	50	24003	58	39864

Table 6: Performance comparison of different pre-trained base models used in the binary classification stage. All models were fine-tuned using the same negative sampling strategy. Metrics are reported on the evaluation set. FPR, FNR, and SMR denote global error rates normalized by the total number of pairs

N

. Results are averaged across 5 folds.

C.3 Combined Pipeline Results

Table 3 compares the retrieval-only baseline with the retrieve-and-rerank pipeline at varying retrieval depths ( $k$ ), relating performance gains to the number of required predictions.

Appendix D Python Package

To facilitate reproducibility and support the digital humanities community, we release the framework described in this paper as an open-source Python package. The locisimiles⁸⁸8https://github.com/julianschelb/locisimiles library streamlines the detection of intertextual links by implementing a standardized “retrieve-and-rerank” pipeline and calculating the task-specific error metrics (SMR, FPR, FNR) defined in Section B.1.

D.1 Python API

The core functionality allows researchers to load custom query and source documents (in CSV format) and execute the detection pipeline using pre-trained models from the Hugging Face Hub.

# 1. Load query and source documents
query_doc = Document("query.csv")
source_doc = Document("source.csv")

# 2. Initialize the pipeline
pipeline = ClassificationPipeline(
    classification_name="...",
)

# 3. Run the pipeline
results = pipeline.run(
    query=query_doc,
    source=source_doc,
)

# 4. Display results
pretty_print(results)

D.2 Graphical User Interface

To lower the barrier to entry, the package includes an optional Gradio-based GUI. This interface can be installed via the optional dependency group (pip install . [gui]) and launched directly from the command line by executing locisimiles-gui.

Workflow.

The application workflow is organized into three sequential stages, as illustrated in Figure 12:

1.

Data Upload: Users ingest custom query and source documents via CSV files.
2.

Configuration: The pipeline is customized by selecting pre-trained models and tuning retrieval parameters (e.g., retrieval depth $k$ and classification confidence thresholds).
3.

Result Exploration: The interactive dashboard presents query segments alongside retrieved source candidates, displaying cosine similarity and classification probability scores, with functionality to export confirmed matches.

Figure 13: Example references. Three instances of text reuse by Jerome included in the ground truth dataset.

Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature

Abstract

1 Introduction

2 Related Work

2.1 Intertextuality Detection in Latin

2.2 Text Reuse in Other Languages

2.3 Further Related Tasks

3 Dataset Construction

3.1 Latin Corpus Curation

3.2 Ground Truth Construction

3.3 Annotation Process

4 Evaluation Framework

4.1 Task Definition

4.2 Evaluation Metrics

4.3 Framework Features

5 Baseline Methods

5.1 Information Retrieval

Training setup.

5.2 Binary Classification

Input construction.

Training setup.

6 Experimental Setup

6.1 Dataset Split

6.2 Model Configuration and Ablations

7 Experimental Results

7.1 Information Retrieval Results

7.2 Binary Classification Results

7.3 Retrieve and Rerank Results

8 Discussion

9 Conclusion

Future Work.

Limitations

Data Coverage.

Labeling Ambiguity.

Acknowledgments

AI Usage Statement

References

Appendix A Corpus Sources

Appendix B Experimental Setup Details

B.1 Evaluation Metrics

B.2 Base Model

Retrieval Models.

Classification Models.

B.3 Negative Sampling Methods

Appendix C Experimental Results

C.1 Information Retrieval Results

C.2 Classification Results

C.3 Combined Pipeline Results

Appendix D Python Package

D.1 Python API

D.2 Graphical User Interface

Workflow.

Loci Similes: A Benchmark for Extracting Intertextualities
in Latin Literature