PDF Converter Pipeline

Production pipeline: PDF â†’ Markdown â†’ Structured Data â†’ Linked Records

Stage 1: PDF â†’ Markdown (Mistral OCR + Vision refinement)
Stage 2: Markdown â†’ Structured JSON (LLM extraction of 16 classes)
Stage 3: Link records by foreign keys (optional)

Quick Start

Full Pipeline (Default)

python -m run_pipeline

Processes all PDFs in documents/ through all three stages. All output saved to output/ folder.

Command Options

Single PDF with vision refinement:

python -m run_pipeline --input documents/ccc_dresden.pdf

OCR only (no vision refinement - faster):

python -m run_pipeline --no-vision

Single PDF + OCR only (fastest for testing):

python -m run_pipeline --input documents/ccc_dresden.pdf --no-vision

Skip mapping stage:

python -m run_pipeline --no-mapping

Force chunked extraction (passes through to extraction):

python -m run_pipeline --chunking

Help:

python -m run_pipeline --help

Typical Workflow

# 1. Test single file with OCR only (~10-15 min for 35MB)
python -m run_pipeline --input documents/ccc_dresden.pdf --no-vision

# 2. If OK, full pipeline with vision (~25-35 min)
python -m run_pipeline --input documents/ccc_dresden.pdf

# 3. If working, process all PDFs
python -m run_pipeline

Output Structure

output/
â”œâ”€â”€ pdf2markdown/          # Markdown files (TIMESTAMP_docname/)
â”œâ”€â”€ extraction/            # Extracted JSON (all classes)
â””â”€â”€ mapping/               # Linked records

Setup

pip install -r requirements.txt

# Create .env with API keys
MISTRAL_API_KEY=sk-...
OPENROUTER_API_KEY=sk-...

Docker (Database)

This repo ships a Postgres service via docker-compose.yml for local DB testing.

docker compose up -d

The database credentials are defined in docker-compose.yml. Configure the app with:

DATABASE_URL=postgresql://pdf_user:pdf_pass@localhost:5432/pdf_converter

Tests

pytest

Configuration

Edit llm_config.yml:

pdf2markdown:
  model: google/gemini-3-flash-preview
  temperature: 0.1
  ocr_model: mistral-ocr-latest

extraction:
  model: deepseek/deepseek-v3.2 # â� Best for tool calling
  temperature: 0.1
  chunking:
    enabled: false
    auto_threshold_tokens: 300000
    chunk_size_tokens: 200000
    chunk_overlap_tokens: 10000
    boundary_mode: paragraph_or_sentence
    keep_tables_intact: true
    table_context_max_items: 0 # 0 = include all same-table rows; reduce to limit prompt size

mapping:
  model: google/gemini-3-flash-preview
  temperature: 0.1

Stage 1: PDF â†’ Markdown

Examples (Using Pipeline Script)

# Single PDF with vision refinement
python -m run_pipeline --input documents/sample.pdf

# Single PDF without vision (faster)
python -m run_pipeline --input documents/sample.pdf --no-vision

# All PDFs, OCR only
python -m run_pipeline --no-vision

# Single PDF without mapping
python -m run_pipeline --input documents/sample.pdf --no-mapping

Direct Module Usage

# Single PDF with vision refinement
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf

# Without vision (use "none" not empty string)
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf --vision-model none

# Batch processing
python -m pdf2markdown.pdf_to_markdown --input documents/ --pattern "*.pdf"

# Advanced options
python -m pdf2markdown.pdf_to_markdown --input large.pdf \
  --max-upload-bytes 5242880 \
  --vision-max-rounds 5 \
  --no-images

Output

output/pdf2markdown/TIMESTAMP_docname/
â”œâ”€â”€ combined_markdown.md       # Final markdown for extraction
â”œâ”€â”€ page-0001.md
â”œâ”€â”€ images/
â”‚   â””â”€â”€ page-0001.jpeg
â””â”€â”€ vision_diffs/
    â””â”€â”€ page-0001-round-1.diff

Stage 2: Extraction (Markdown â†’ JSON)

Examples

# Extract all classes
python -m extraction.scripts.extract \
  --markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md

# Force chunking (still uses llm_config.yml sizes/thresholds)
python -m extraction.scripts.extract \
  --markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md \
  --chunking

# Specific classes only
python -m extraction.scripts.extract \
  --markdown path/to/combined_markdown.md \
  --class-names City CityAnnualStats Initiative

# Different model
python -m extraction.scripts.extract \
  --markdown path/to/combined_markdown.md \
  --model anthropic/claude-3.5-sonnet

Available Classes

City                  CityAnnualStats        ClimateCityContract    Sector
EmissionRecord        CityBudget             BudgetFunding          FundingSource
Initiative            InitiativeStakeholder  Indicator              IndicatorValue
CityTarget            InitiativeIndicator    TefCategory            InitiativeTef

Output Example

// output/extraction/CityAnnualStats.json
[
  {
    "year": 2023,
    "population": 628718,
    "populationDensity": 2129,
    "notes": "As at 31.12.2023"
  },
  {
    "year": 2019,
    "notes": "Baseline year for GHG inventory"
  }
]

Key Features

âœ… Year Extraction - Properly extracts years from: "As at 31.12.2023" â†’ 2023, "base year 2019" â†’ 2019, "by 2030" â†’ 2030

âœ… Tool Calling - Uses function calls for structured output

âœ… Validation - Pydantic models ensure data integrity

âœ… Duplicate Detection - Skips duplicate records

âœ… Error Reporting - Detailed logs show validation results

ƒo. Large Document Chunking - Auto-chunks Markdown above 300k tokens, preserves paragraph/sentence boundaries, and keeps tables intact (configured in llm_config.yml).

Stage 3: Mapping (Optional)

Link foreign keys between extracted records. Reads from output/extraction by default and writes to output/mapping.

Quick Start

# Link foreign keys (uses default input/output directories)
python -m mapping.scripts.mapping --apply

# Delete old mappings before re-running
python -m mapping.scripts.mapping --apply --delete-old

# With custom model
python -m mapping.scripts.mapping --apply --model gpt-4

Advanced Usage

# Map specific table only
python -m mapping.scripts.mapping --apply --only-table EmissionRecord

# Custom input/output directories
python -m mapping.scripts.mapping --apply \
  --input-dir extraction/output \
  --work-dir custom/mapping/dir

# Review mappings without applying
python -m mapping.scripts.mapping --review

Defaults

Input: output/extraction/ (extraction outputs)
Output: output/mapping/ (linked records)

Full Workflow Example

# 1. Extract from markdown
python -m extraction.scripts.extract \
  --markdown output/pdf2markdown/20260120_184105_ccc_leipzig/combined_markdown.md \
  --output-dir output/extraction \
  --overwrite

# 2. Map foreign keys (uses default dirs: output/extraction â†’ output/mapping)
python -m mapping.scripts.mapping --apply --delete-old

# 3. Validate mappings
python -m mapping.scripts.mapping --review

Stage 4: Load Into DB (Optional)

# Validate only
python -m app.modules.db_insert.scripts.load_mapped_data --dry-run

# Insert after validation
python -m app.modules.db_insert.scripts.load_mapped_data

Requires DATABASE_URL (or DB_URL) in .env. Reports are written to output/db_load_reports/.

Test the DB connection:

python -m app.scripts.test_db_connection

Check row counts and sample rows:

python -m app.scripts.test_insert

Project Structure

project_root/
â”œâ”€â”€ pdf2markdown/              # Stage 1: PDF â†’ Markdown
â”œâ”€â”€ extraction/                # Stage 2: Markdown â†’ JSON
â”‚   â”œâ”€â”€ prompts/              # LLM prompts by class
â”‚   â”œâ”€â”€ tools/                # Tool definitions
â”‚   â”œâ”€â”€ utils/                # Validation & parsing
â”‚   â”œâ”€â”€ output/               # Extracted JSON files
â”‚   â””â”€â”€ extract.py            # Core logic
â”œâ”€â”€ mapping/                   # Stage 3: Link records
â”œâ”€â”€ database/
â”‚   â””â”€â”€ schemas.py            # Pydantic schemas (16 classes)
â”œâ”€â”€ documents/                # Input PDFs
â”œâ”€â”€ tests/
â”œâ”€â”€ llm_config.yml            # Model configuration
â”œâ”€â”€ run_pipeline.py           # Full pipeline
â”œâ”€â”€ requirements.txt
â””â”€â”€ README.md

Typical Workflows

Single Document

python -m pdf2markdown.pdf_to_markdown --input documents/my_city.pdf
python -m extraction.scripts.extract --markdown pdf2markdown/output/TIMESTAMP_my_city/combined_markdown.md
cat output/extraction/City.json

Batch Processing

python -m run_pipeline
# Results in: output/pdf2markdown/, output/extraction/, output/mapping/

Test Specific Class

python -m extraction.scripts.extract --markdown existing.md --class-names CityAnnualStats

Python API

PDF to Markdown

from pathlib import Path
from mistralai import Mistral
from openai import OpenAI
from pdf2markdown.pdf_to_markdown import pdf_to_markdown_mistral

mistral = Mistral(api_key="sk-...")
vision = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")

output = pdf_to_markdown_mistral(
    pdf_path=Path("documents/sample.pdf"),
    output_root=Path("pdf2markdown/output"),
    client=mistral,
    vision_client=vision,
    vision_model="google/gemini-3-flash-preview",
)

Extract Data

from openai import OpenAI
from extraction.extract import run_class_extraction
from database.schemas import City

client = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")

run_class_extraction(
    client=client,
    model_name="deepseek/deepseek-v3.2",
    system_prompt="...",
    user_template="...",
    markdown_text=markdown_content,
    model_cls=City,
    output_dir=Path("extraction/output"),
)

Troubleshooting

Issue	Solution
Missing Mistral API key	Set `MISTRAL_API_KEY` in `.env`
Vision refinement fails	Check `OPENROUTER_API_KEY` in `.env`
Missing required field: year	Markdown may lack year info, check `extraction/debug_logs/`
OpenRouter API error	Verify API key has credits
Large PDF timeout	Use `--max-upload-bytes 5242880` to split into 5MB chunks

Recent Fixes (January 2026)

âœ… Model Selection - Switched extraction to deepseek/deepseek-v3.2 for superior tool calling (was generating empty objects with google/gemini)

âœ… Year Extraction - Enhanced CityAnnualStats.md prompt with explicit examples for extracting years from varied text patterns

âœ… Error Messages - Improved validation feedback to show exactly which fields are missing and what data was received

Architecture Details

PDF â†’ Markdown Flow

PDF â†’ [Mistral OCR] â†’ Markdown + Images
  â†“
[2-Page Windows] â†’ {image_left, markdown_left, image_right, markdown_right}
  â†“
[Vision Agent] â†’ Tool calls â†’ Edits
  â†“
Final Markdown âœ“

How to Extend

New extraction class? Add model to database/schemas.py and prompt to extraction/prompts/
Different PDF pipeline? Modify pdf2markdown/pdf_to_markdown.py
Custom mapping? Edit mapping/mappers/

References

License

See LICENSE.md

Last Updated: January 16, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
app		app
database		database
extraction		extraction
k8s		k8s
mapping		mapping
pdf2markdown		pdf2markdown
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
alembic.ini		alembic.ini
architecture.md		architecture.md
docker-compose.yml		docker-compose.yml
llm_config.yml		llm_config.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

Open-Earth-Foundation/PDF_converter

Folders and files

Latest commit

History

Repository files navigation

PDF Converter Pipeline

Quick Start

Full Pipeline (Default)

Command Options

Typical Workflow

Output Structure

Setup

Docker (Database)

Tests

Configuration

Stage 1: PDF â†’ Markdown

Examples (Using Pipeline Script)

Direct Module Usage

Output

Stage 2: Extraction (Markdown â†’ JSON)

Examples

Available Classes

Output Example

Key Features

Stage 3: Mapping (Optional)

Quick Start

Advanced Usage

Defaults

Full Workflow Example

Stage 4: Load Into DB (Optional)

Project Structure

Typical Workflows

Single Document

Batch Processing

Test Specific Class

Python API

PDF to Markdown

Extract Data

Troubleshooting

Recent Fixes (January 2026)

Architecture Details

PDF â†’ Markdown Flow

How to Extend

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages