A curated list of papers, datasets, and resources dedicated to Autoformalization and Neural Theorem Proving.
Scope: This repository strictly focuses on translating natural language mathematics into formal logic (e.g., Lean, Isabelle) and solving math problems using formal verification. General LLM reasoning is out of scope.
![]()
Figure by paper 'Autoformalization in the Era of Large Language Models: A Survey'
If you are new to this field, start here. These are the most influential works.
- [2025] Autoformalization in the Era of Large Language Models: A Survey [Paper]
- A comprehensive survey of autoformalization with LLMs, covering data, models, evaluation, and the role of formal verification for LLM outputs.
- [2026] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation [Paper]
- Proposes Difficulty-Aware GRPO and question reformulation techniques to dynamically balance training difficulty, offering a key methodology for enhancing mathematical reasoning capabilities.
- [2025] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [Paper]
- Introduces the “formal reasoning pattern” and a large RL-trained Lean 4 prover that achieves strong MiniF2F performance with high sample efficiency and clear scaling with model size.
- [2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper] [📝Deep Dive]
- Proposes a teacher–student pipeline with Lean feedback to synthesize large NL–FL datasets and trains an ATLAS translator that achieves >90% pass@8 on MiniF2F.
- [2023] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [Paper]
- Introduces LeanDojo, an open-source environment for interacting with Lean, and ReProver, a retrieval-augmented solver that selects premises from Mathlib to generate proofs.
- [2021] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics [Paper] [📝Deep Dive]
- Establishes a unified, cross-system benchmark for formal Olympiad-level mathematics, serving as the primary testbed for evaluating autoformalization models.
- [2022] Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [Paper] [📝Deep Dive]
- Presents the “Draft–Sketch–Prove” pipeline, where LLM-generated informal proofs are turned into formal sketches that significantly boost downstream prover success rates.
Comprehensive overviews of the field.
- [2026] AI for Mathematics: Progress, Challenges, and Prospects [Paper]
- [2025] Autoformalization in the Era of Large Language Models: A Survey [Paper]
Aligned pairs of Natural Language (NL) and Formal Logic (FL).
- [2024] Herald: A Natural Language Annotated Lean 4 Dataset [Paper] [📝Deep Dive]
- [2021] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics [Paper] [📝Deep Dive]
Datasets created or augmented by LLMs to overcome data scarcity.
- [2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper] [📝Deep Dive]
Benchmarks designed to evaluate step-by-step logical reasoning and chain-of-thought capabilities.
- [2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning [Paper]
End-to-end systems that aim to generate complete proofs and verify them.
- [2025] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [Paper]
- [2024] PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition [Paper]
Models specialized in translating Natural Language (NL) to Formal Logic (FL).
- [2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper]
- [2024] Herald: A Natural Language Annotated Lean 4 Dataset [Paper] [📝Deep Dive]
Training methodologies that leverage Reinforcement Learning (RL) and verifiable feedback from proof assistants to optimize reasoning capabilities.
- [2026] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation [Paper]
- [2026] From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for RL of Open-ended Generation [Paper]
Methods that utilize proof planning (sketching) or search algorithms to navigate the solution space.
- [2026] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance LLMs [Paper]
- [2024] Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards [Paper]
- [2022] Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [Paper] [📝Deep Dive]
Systems that improve solutions through compiler feedback or error messages.
- [2024] Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [Paper]
Approaches that leverage code execution (simulation) or formal verification to enhance reasoning accuracy.
- [2025] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment [Paper] [📝Deep Dive]
Methods that retrieve relevant theorems or premises from libraries (e.g., Mathlib).
- [2023] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [Paper]
Collaborative agents specifically designed for theorem proving tasks.
- [2025] MASA: LLM-Driven Multi-Agent Systems for Autoformalization [Paper]
Metrics and methodologies to assess the quality of autoformalization and proving.
- Pass@k
- The standard metric: probability that at least one of
kgenerated samples solves the problem (verified by the proof assistant).
- The standard metric: probability that at least one of
- Compilation Rate
- The percentage of generated formal code that successfully compiles without syntax or type errors. A crucial baseline metric for autoformalizers.
-
Automated Equivalence Checking
- Using automated provers to verify if the generated formal statement ($F$) is logically equivalent to the ground truth ($F_{gt}$), i.e., proving $F \leftrightarrow F_{gt}$.
-
Back-Translation Quality (NL
$\to$ FL$\to$ NL)- Translating the generated formal code back into natural language and comparing it with the original text using NLP metrics (BLEU, ROUGE, BERTScore).
- Informal-to-Formal Alignment
- Measuring how well the steps in an informal proof map to the tactics in a formal proof.
- Premise Selection Accuracy
- Evaluating if the model retrieves the correct axioms and theorems (from libraries like Mathlib) required for the proof.
Research papers focusing on novel evaluation protocols and alignment checking.
- [2024] FormalAlign: Automated Alignment Evaluation for Autoformalization [Paper]
Proof Assistants and Environments.
- Lean 4
- The most popular tool for modern autoformalization research.
- Isabelle
- Coq
Contributions are welcome! Please create a pull request.
- Woojin Cho: (Undergraduate Researcher @ SNU, Computer Science and Engineering)
- e-mail: woojincho.cs@gmail.com
- Kimina-Prover (2025): Exploration of Reinforcement Learning and scaling laws in Formal Reasoning.
- PutnamBench (2024): Evaluation of current LLMs on university-level math and the limits of Neural Theorem Provers.
- KELPS (2025): Deep dive into Iterative Refinement and semantic-syntactic alignment.
- Literature Review: Identifying latest papers on Agentic Workflows for formal verification.
- Benchmark Comparison: Comparative analysis of MiniF2F, ProofNet, and PutnamBench.
- Tooling Updates: Curating resources for Lean 4 environment setup and Mathlib4 integration.
- Reinforcement Learning: Surveying Reward Model designs specifically tailored for formal logic.
