Reliable Evaluation and Benchmarks for Statement Autoformalization

Poiroux, Auguste; Weiss, Gail; Kunčak, Viktor; Bosselut, Antoine

Computer Science > Computation and Language

arXiv:2406.07222v3 (cs)

[Submitted on 11 Jun 2024 (v1), last revised 29 Oct 2025 (this version, v3)]

Title:Reliable Evaluation and Benchmarks for Statement Autoformalization

Authors:Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut

View PDF

Abstract:Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.

Comments:	Accepted to EMNLP 2025. New benchmarks released, see this https URL , this https URL , and this https URL . For code, see this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.07222 [cs.CL]
	(or arXiv:2406.07222v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.07222

Submission history

From: Auguste Poiroux [view email]
[v1] Tue, 11 Jun 2024 13:01:50 UTC (492 KB)
[v2] Tue, 11 Feb 2025 11:02:10 UTC (576 KB)
[v3] Wed, 29 Oct 2025 11:36:28 UTC (51 KB)

Computer Science > Computation and Language

Title:Reliable Evaluation and Benchmarks for Statement Autoformalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Reliable Evaluation and Benchmarks for Statement Autoformalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators