Production pipeline: PDF → Markdown → Structured Data → Linked Records
- Stage 1: PDF → Markdown (Mistral OCR + Vision refinement)
- Stage 2: Markdown → Structured JSON (LLM extraction of 16 classes)
- Stage 3: Link records by foreign keys (optional)
python -m run_pipelineProcesses all PDFs in documents/ through all three stages. All output saved to output/ folder.
Single PDF with vision refinement:
python -m run_pipeline --input documents/ccc_dresden.pdfOCR only (no vision refinement - faster):
python -m run_pipeline --no-visionSingle PDF + OCR only (fastest for testing):
python -m run_pipeline --input documents/ccc_dresden.pdf --no-visionSkip mapping stage:
python -m run_pipeline --no-mappingForce chunked extraction (passes through to extraction):
python -m run_pipeline --chunkingHelp:
python -m run_pipeline --help# 1. Test single file with OCR only (~10-15 min for 35MB)
python -m run_pipeline --input documents/ccc_dresden.pdf --no-vision
# 2. If OK, full pipeline with vision (~25-35 min)
python -m run_pipeline --input documents/ccc_dresden.pdf
# 3. If working, process all PDFs
python -m run_pipelineoutput/
├── pdf2markdown/ # Markdown files (TIMESTAMP_docname/)
├── extraction/ # Extracted JSON (all classes)
└── mapping/ # Linked records
pip install -r requirements.txt
# Create .env with API keys
MISTRAL_API_KEY=sk-...
OPENROUTER_API_KEY=sk-...This repo ships a Postgres service via docker-compose.yml for local DB testing.
docker compose up -dThe database credentials are defined in docker-compose.yml. Configure the app with:
DATABASE_URL=postgresql://pdf_user:pdf_pass@localhost:5432/pdf_converterpytestEdit llm_config.yml:
pdf2markdown:
model: google/gemini-3-flash-preview
temperature: 0.1
ocr_model: mistral-ocr-latest
extraction:
model: deepseek/deepseek-v3.2 # â� Best for tool calling
temperature: 0.1
chunking:
enabled: false
auto_threshold_tokens: 300000
chunk_size_tokens: 200000
chunk_overlap_tokens: 10000
boundary_mode: paragraph_or_sentence
keep_tables_intact: true
table_context_max_items: 0 # 0 = include all same-table rows; reduce to limit prompt size
mapping:
model: google/gemini-3-flash-preview
temperature: 0.1# Single PDF with vision refinement
python -m run_pipeline --input documents/sample.pdf
# Single PDF without vision (faster)
python -m run_pipeline --input documents/sample.pdf --no-vision
# All PDFs, OCR only
python -m run_pipeline --no-vision
# Single PDF without mapping
python -m run_pipeline --input documents/sample.pdf --no-mapping# Single PDF with vision refinement
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf
# Without vision (use "none" not empty string)
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf --vision-model none
# Batch processing
python -m pdf2markdown.pdf_to_markdown --input documents/ --pattern "*.pdf"
# Advanced options
python -m pdf2markdown.pdf_to_markdown --input large.pdf \
--max-upload-bytes 5242880 \
--vision-max-rounds 5 \
--no-imagesoutput/pdf2markdown/TIMESTAMP_docname/
├── combined_markdown.md # Final markdown for extraction
├── page-0001.md
├── images/
│ └── page-0001.jpeg
└── vision_diffs/
└── page-0001-round-1.diff
# Extract all classes
python -m extraction.scripts.extract \
--markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md
# Force chunking (still uses llm_config.yml sizes/thresholds)
python -m extraction.scripts.extract \
--markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md \
--chunking
# Specific classes only
python -m extraction.scripts.extract \
--markdown path/to/combined_markdown.md \
--class-names City CityAnnualStats Initiative
# Different model
python -m extraction.scripts.extract \
--markdown path/to/combined_markdown.md \
--model anthropic/claude-3.5-sonnetCity CityAnnualStats ClimateCityContract Sector
EmissionRecord CityBudget BudgetFunding FundingSource
Initiative InitiativeStakeholder Indicator IndicatorValue
CityTarget InitiativeIndicator TefCategory InitiativeTef
// output/extraction/CityAnnualStats.json
[
{
"year": 2023,
"population": 628718,
"populationDensity": 2129,
"notes": "As at 31.12.2023"
},
{
"year": 2019,
"notes": "Baseline year for GHG inventory"
}
]✅ Year Extraction - Properly extracts years from: "As at 31.12.2023" → 2023, "base year 2019" → 2019, "by 2030" → 2030
✅ Tool Calling - Uses function calls for structured output
✅ Validation - Pydantic models ensure data integrity
✅ Duplicate Detection - Skips duplicate records
✅ Error Reporting - Detailed logs show validation results
ƒo. Large Document Chunking - Auto-chunks Markdown above 300k tokens, preserves paragraph/sentence boundaries, and keeps tables intact (configured in llm_config.yml).
Link foreign keys between extracted records. Reads from output/extraction by default and writes to output/mapping.
# Link foreign keys (uses default input/output directories)
python -m mapping.scripts.mapping --apply
# Delete old mappings before re-running
python -m mapping.scripts.mapping --apply --delete-old
# With custom model
python -m mapping.scripts.mapping --apply --model gpt-4# Map specific table only
python -m mapping.scripts.mapping --apply --only-table EmissionRecord
# Custom input/output directories
python -m mapping.scripts.mapping --apply \
--input-dir extraction/output \
--work-dir custom/mapping/dir
# Review mappings without applying
python -m mapping.scripts.mapping --review- Input:
output/extraction/(extraction outputs) - Output:
output/mapping/(linked records)
# 1. Extract from markdown
python -m extraction.scripts.extract \
--markdown output/pdf2markdown/20260120_184105_ccc_leipzig/combined_markdown.md \
--output-dir output/extraction \
--overwrite
# 2. Map foreign keys (uses default dirs: output/extraction → output/mapping)
python -m mapping.scripts.mapping --apply --delete-old
# 3. Validate mappings
python -m mapping.scripts.mapping --review# Validate only
python -m app.modules.db_insert.scripts.load_mapped_data --dry-run
# Insert after validation
python -m app.modules.db_insert.scripts.load_mapped_dataRequires DATABASE_URL (or DB_URL) in .env. Reports are written to output/db_load_reports/.
Test the DB connection:
python -m app.scripts.test_db_connectionCheck row counts and sample rows:
python -m app.scripts.test_insertproject_root/
├── pdf2markdown/ # Stage 1: PDF → Markdown
├── extraction/ # Stage 2: Markdown → JSON
│ ├── prompts/ # LLM prompts by class
│ ├── tools/ # Tool definitions
│ ├── utils/ # Validation & parsing
│ ├── output/ # Extracted JSON files
│ └── extract.py # Core logic
├── mapping/ # Stage 3: Link records
├── database/
│ └── schemas.py # Pydantic schemas (16 classes)
├── documents/ # Input PDFs
├── tests/
├── llm_config.yml # Model configuration
├── run_pipeline.py # Full pipeline
├── requirements.txt
└── README.md
python -m pdf2markdown.pdf_to_markdown --input documents/my_city.pdf
python -m extraction.scripts.extract --markdown pdf2markdown/output/TIMESTAMP_my_city/combined_markdown.md
cat output/extraction/City.jsonpython -m run_pipeline
# Results in: output/pdf2markdown/, output/extraction/, output/mapping/python -m extraction.scripts.extract --markdown existing.md --class-names CityAnnualStatsfrom pathlib import Path
from mistralai import Mistral
from openai import OpenAI
from pdf2markdown.pdf_to_markdown import pdf_to_markdown_mistral
mistral = Mistral(api_key="sk-...")
vision = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")
output = pdf_to_markdown_mistral(
pdf_path=Path("documents/sample.pdf"),
output_root=Path("pdf2markdown/output"),
client=mistral,
vision_client=vision,
vision_model="google/gemini-3-flash-preview",
)from openai import OpenAI
from extraction.extract import run_class_extraction
from database.schemas import City
client = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")
run_class_extraction(
client=client,
model_name="deepseek/deepseek-v3.2",
system_prompt="...",
user_template="...",
markdown_text=markdown_content,
model_cls=City,
output_dir=Path("extraction/output"),
)| Issue | Solution |
|---|---|
| Missing Mistral API key | Set MISTRAL_API_KEY in .env |
| Vision refinement fails | Check OPENROUTER_API_KEY in .env |
| Missing required field: year | Markdown may lack year info, check extraction/debug_logs/ |
| OpenRouter API error | Verify API key has credits |
| Large PDF timeout | Use --max-upload-bytes 5242880 to split into 5MB chunks |
✅ Model Selection - Switched extraction to deepseek/deepseek-v3.2 for superior tool calling (was generating empty objects with google/gemini)
✅ Year Extraction - Enhanced CityAnnualStats.md prompt with explicit examples for extracting years from varied text patterns
✅ Error Messages - Improved validation feedback to show exactly which fields are missing and what data was received
PDF → [Mistral OCR] → Markdown + Images
↓
[2-Page Windows] → {image_left, markdown_left, image_right, markdown_right}
↓
[Vision Agent] → Tool calls → Edits
↓
Final Markdown ✓
- New extraction class? Add model to
database/schemas.pyand prompt toextraction/prompts/ - Different PDF pipeline? Modify
pdf2markdown/pdf_to_markdown.py - Custom mapping? Edit
mapping/mappers/
See LICENSE.md
Last Updated: January 16, 2026