Awesome Autoformalization

A curated list of papers, datasets, and resources dedicated to Autoformalization and Neural Theorem Proving.

Scope: This repository strictly focuses on translating natural language mathematics into formal logic (e.g., Lean, Isabelle) and solving math problems using formal verification. General LLM reasoning is out of scope.

Figure by paper 'Autoformalization in the Era of Large Language Models: A Survey'

🔥 Must-Read & Landmarks

If you are new to this field, start here. These are the most influential works.

[2025] Autoformalization in the Era of Large Language Models: A Survey [Paper]
- A comprehensive survey of autoformalization with LLMs, covering data, models, evaluation, and the role of formal verification for LLM outputs.
[2026] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation [Paper]
- Proposes Difficulty-Aware GRPO and question reformulation techniques to dynamically balance training difficulty, offering a key methodology for enhancing mathematical reasoning capabilities.
[2025] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [Paper]
- Introduces the “formal reasoning pattern” and a large RL-trained Lean 4 prover that achieves strong MiniF2F performance with high sample efficiency and clear scaling with model size.
[2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper] [📝Deep Dive]
- Proposes a teacher–student pipeline with Lean feedback to synthesize large NL–FL datasets and trains an ATLAS translator that achieves >90% pass@8 on MiniF2F.
[2023] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [Paper]
- Introduces LeanDojo, an open-source environment for interacting with Lean, and ReProver, a retrieval-augmented solver that selects premises from Mathlib to generate proofs.
[2021] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics [Paper] [📝Deep Dive]
- Establishes a unified, cross-system benchmark for formal Olympiad-level mathematics, serving as the primary testbed for evaluating autoformalization models.
[2022] Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [Paper] [📝Deep Dive]
- Presents the “Draft–Sketch–Prove” pipeline, where LLM-generated informal proofs are turned into formal sketches that significantly boost downstream prover success rates.

📖 Table of Contents

Survey Papers
Datasets & Benchmarks
Models & Agents
- Theorem Provers (Solvers)
- Autoformalizers (Translators)
Methodologies
Evaluation
- Standard Metrics
- Evaluation Frameworks
Tools

🗺️ Survey Papers

Comprehensive overviews of the field.

[2026] AI for Mathematics: Progress, Challenges, and Prospects [Paper]
[2025] Autoformalization in the Era of Large Language Models: A Survey [Paper]

📚 Datasets & Benchmarks

Parallel Corpora (NL-FL Pairs)

Aligned pairs of Natural Language (NL) and Formal Logic (FL).

[2024] Herald: A Natural Language Annotated Lean 4 Dataset [Paper] [📝Deep Dive]
[2021] MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics [Paper] [📝Deep Dive]

Synthetic Data Generation

Datasets created or augmented by LLMs to overcome data scarcity.

[2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper] [📝Deep Dive]

Reasoning Benchmarks

Benchmarks designed to evaluate step-by-step logical reasoning and chain-of-thought capabilities.

[2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning [Paper]

🚀 Models & Agents

🤖 Theorem Provers (Solvers)

End-to-end systems that aim to generate complete proofs and verify them.

[2025] Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [Paper]
[2024] PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition [Paper]

📝 Autoformalizers (Translators)

Models specialized in translating Natural Language (NL) to Formal Logic (FL).

[2025] ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data [Paper]
[2024] Herald: A Natural Language Annotated Lean 4 Dataset [Paper] [📝Deep Dive]

⚙️ Methodologies

Reinforcement Learning & Verifiable Rewards

Training methodologies that leverage Reinforcement Learning (RL) and verifiable feedback from proof assistants to optimize reasoning capabilities.

[2026] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation [Paper]
[2026] From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for RL of Open-ended Generation [Paper]

Search & Planning

Methods that utilize proof planning (sketching) or search algorithms to navigate the solution space.

[2026] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance LLMs [Paper]
[2024] Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards [Paper]
[2022] Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [Paper] [📝Deep Dive]

Iterative Refinement & Feedback

Systems that improve solutions through compiler feedback or error messages.

[2024] Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [Paper]

Execution & Verification

Approaches that leverage code execution (simulation) or formal verification to enhance reasoning accuracy.

[2025] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment [Paper] [📝Deep Dive]

Retrieval-Augmented Generation (RAG)

Methods that retrieve relevant theorems or premises from libraries (e.g., Mathlib).

[2023] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [Paper]

Multi-Agent Systems

Collaborative agents specifically designed for theorem proving tasks.

[2025] MASA: LLM-Driven Multi-Agent Systems for Autoformalization [Paper]

⚖️ Evaluation

Metrics and methodologies to assess the quality of autoformalization and proving.

📊 Standard Metrics

1. Functional Correctness (Does it work?)

Pass@k
- The standard metric: probability that at least one of k generated samples solves the problem (verified by the proof assistant).
Compilation Rate
- The percentage of generated formal code that successfully compiles without syntax or type errors. A crucial baseline metric for autoformalizers.

2. Semantic Consistency (Is it the right translation?)

Automated Equivalence Checking
- Using automated provers to verify if the generated formal statement ($F$) is logically equivalent to the ground truth ($F_{gt}$), i.e., proving $F \leftrightarrow F_{gt}$.
Back-Translation Quality (NL $\to$ FL $\to$ NL)
- Translating the generated formal code back into natural language and comparing it with the original text using NLP metrics (BLEU, ROUGE, BERTScore).

3. Alignment & Reasoning

Informal-to-Formal Alignment
- Measuring how well the steps in an informal proof map to the tactics in a formal proof.
Premise Selection Accuracy
- Evaluating if the model retrieves the correct axioms and theorems (from libraries like Mathlib) required for the proof.

📑 Evaluation Frameworks

Research papers focusing on novel evaluation protocols and alignment checking.

[2024] FormalAlign: Automated Alignment Evaluation for Autoformalization [Paper]

🛠️ Tools

Proof Assistants and Environments.

Lean 4
- The most popular tool for modern autoformalization research.
Isabelle
Coq

🤝 Contribution

Contributions are welcome! Please create a pull request.

Maintainer

Woojin Cho: (Undergraduate Researcher @ SNU, Computer Science and Engineering)
- e-mail: woojincho.cs@gmail.com

🗓 Roadmap

Upcoming Deep Dives

Kimina-Prover (2025): Exploration of Reinforcement Learning and scaling laws in Formal Reasoning.
PutnamBench (2024): Evaluation of current LLMs on university-level math and the limits of Neural Theorem Provers.
KELPS (2025): Deep dive into Iterative Refinement and semantic-syntactic alignment.

Research & Expansion

Literature Review: Identifying latest papers on Agentic Workflows for formal verification.
Benchmark Comparison: Comparative analysis of MiniF2F, ProofNet, and PutnamBench.
Tooling Updates: Curating resources for Lean 4 environment setup and Mathlib4 integration.
Reinforcement Learning: Surveying Reward Model designs specifically tailored for formal logic.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
images		images
notes		notes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Autoformalization

🔥 Must-Read & Landmarks

📖 Table of Contents

🗺️ Survey Papers

📚 Datasets & Benchmarks

Parallel Corpora (NL-FL Pairs)

Synthetic Data Generation

Reasoning Benchmarks

🚀 Models & Agents

🤖 Theorem Provers (Solvers)

📝 Autoformalizers (Translators)

⚙️ Methodologies

Reinforcement Learning & Verifiable Rewards

Search & Planning

Iterative Refinement & Feedback

Execution & Verification

Retrieval-Augmented Generation (RAG)

Multi-Agent Systems

⚖️ Evaluation

📊 Standard Metrics

1. Functional Correctness (Does it work?)

2. Semantic Consistency (Is it the right translation?)

3. Alignment & Reasoning

📑 Evaluation Frameworks

🛠️ Tools

🤝 Contribution

Maintainer

🗓 Roadmap

Upcoming Deep Dives

Research & Expansion

About

Uh oh!

Releases

Packages

WoojinCho-Ryan/awesome-autoformalization

Folders and files

Latest commit

History

Repository files navigation

Awesome Autoformalization

🔥 Must-Read & Landmarks

📖 Table of Contents

🗺️ Survey Papers

📚 Datasets & Benchmarks

Parallel Corpora (NL-FL Pairs)

Synthetic Data Generation

Reasoning Benchmarks

🚀 Models & Agents

🤖 Theorem Provers (Solvers)

📝 Autoformalizers (Translators)

⚙️ Methodologies

Reinforcement Learning & Verifiable Rewards

Search & Planning

Iterative Refinement & Feedback

Execution & Verification

Retrieval-Augmented Generation (RAG)

Multi-Agent Systems

⚖️ Evaluation

📊 Standard Metrics

1. Functional Correctness (Does it work?)

2. Semantic Consistency (Is it the right translation?)

3. Alignment & Reasoning

📑 Evaluation Frameworks

🛠️ Tools

🤝 Contribution

Maintainer

🗓 Roadmap

Upcoming Deep Dives

Research & Expansion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages