Skip to content

Open-Earth-Foundation/PDF_converter

Repository files navigation

PDF Converter Pipeline

Production pipeline: PDF → Markdown → Structured Data → Linked Records

  • Stage 1: PDF → Markdown (Mistral OCR + Vision refinement)
  • Stage 2: Markdown → Structured JSON (LLM extraction of 16 classes)
  • Stage 3: Link records by foreign keys (optional)

Quick Start

Full Pipeline (Default)

python -m run_pipeline

Processes all PDFs in documents/ through all three stages. All output saved to output/ folder.

Command Options

Single PDF with vision refinement:

python -m run_pipeline --input documents/ccc_dresden.pdf

OCR only (no vision refinement - faster):

python -m run_pipeline --no-vision

Single PDF + OCR only (fastest for testing):

python -m run_pipeline --input documents/ccc_dresden.pdf --no-vision

Skip mapping stage:

python -m run_pipeline --no-mapping

Force chunked extraction (passes through to extraction):

python -m run_pipeline --chunking

Help:

python -m run_pipeline --help

Typical Workflow

# 1. Test single file with OCR only (~10-15 min for 35MB)
python -m run_pipeline --input documents/ccc_dresden.pdf --no-vision

# 2. If OK, full pipeline with vision (~25-35 min)
python -m run_pipeline --input documents/ccc_dresden.pdf

# 3. If working, process all PDFs
python -m run_pipeline

Output Structure

output/
├── pdf2markdown/          # Markdown files (TIMESTAMP_docname/)
├── extraction/            # Extracted JSON (all classes)
└── mapping/               # Linked records

Setup

pip install -r requirements.txt

# Create .env with API keys
MISTRAL_API_KEY=sk-...
OPENROUTER_API_KEY=sk-...

Docker (Database)

This repo ships a Postgres service via docker-compose.yml for local DB testing.

docker compose up -d

The database credentials are defined in docker-compose.yml. Configure the app with:

DATABASE_URL=postgresql://pdf_user:pdf_pass@localhost:5432/pdf_converter

Tests

pytest

Configuration

Edit llm_config.yml:

pdf2markdown:
  model: google/gemini-3-flash-preview
  temperature: 0.1
  ocr_model: mistral-ocr-latest

extraction:
  model: deepseek/deepseek-v3.2 # â­� Best for tool calling
  temperature: 0.1
  chunking:
    enabled: false
    auto_threshold_tokens: 300000
    chunk_size_tokens: 200000
    chunk_overlap_tokens: 10000
    boundary_mode: paragraph_or_sentence
    keep_tables_intact: true
    table_context_max_items: 0 # 0 = include all same-table rows; reduce to limit prompt size

mapping:
  model: google/gemini-3-flash-preview
  temperature: 0.1

Stage 1: PDF → Markdown

Examples (Using Pipeline Script)

# Single PDF with vision refinement
python -m run_pipeline --input documents/sample.pdf

# Single PDF without vision (faster)
python -m run_pipeline --input documents/sample.pdf --no-vision

# All PDFs, OCR only
python -m run_pipeline --no-vision

# Single PDF without mapping
python -m run_pipeline --input documents/sample.pdf --no-mapping

Direct Module Usage

# Single PDF with vision refinement
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf

# Without vision (use "none" not empty string)
python -m pdf2markdown.pdf_to_markdown --input documents/sample.pdf --vision-model none

# Batch processing
python -m pdf2markdown.pdf_to_markdown --input documents/ --pattern "*.pdf"

# Advanced options
python -m pdf2markdown.pdf_to_markdown --input large.pdf \
  --max-upload-bytes 5242880 \
  --vision-max-rounds 5 \
  --no-images

Output

output/pdf2markdown/TIMESTAMP_docname/
├── combined_markdown.md       # Final markdown for extraction
├── page-0001.md
├── images/
│   └── page-0001.jpeg
└── vision_diffs/
    └── page-0001-round-1.diff

Stage 2: Extraction (Markdown → JSON)

Examples

# Extract all classes
python -m extraction.scripts.extract \
  --markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md

# Force chunking (still uses llm_config.yml sizes/thresholds)
python -m extraction.scripts.extract \
  --markdown pdf2markdown/output/TIMESTAMP_doc/combined_markdown.md \
  --chunking

# Specific classes only
python -m extraction.scripts.extract \
  --markdown path/to/combined_markdown.md \
  --class-names City CityAnnualStats Initiative

# Different model
python -m extraction.scripts.extract \
  --markdown path/to/combined_markdown.md \
  --model anthropic/claude-3.5-sonnet

Available Classes

City                  CityAnnualStats        ClimateCityContract    Sector
EmissionRecord        CityBudget             BudgetFunding          FundingSource
Initiative            InitiativeStakeholder  Indicator              IndicatorValue
CityTarget            InitiativeIndicator    TefCategory            InitiativeTef

Output Example

// output/extraction/CityAnnualStats.json
[
  {
    "year": 2023,
    "population": 628718,
    "populationDensity": 2129,
    "notes": "As at 31.12.2023"
  },
  {
    "year": 2019,
    "notes": "Baseline year for GHG inventory"
  }
]

Key Features

✅ Year Extraction - Properly extracts years from: "As at 31.12.2023" → 2023, "base year 2019" → 2019, "by 2030" → 2030

✅ Tool Calling - Uses function calls for structured output

✅ Validation - Pydantic models ensure data integrity

✅ Duplicate Detection - Skips duplicate records

✅ Error Reporting - Detailed logs show validation results

ƒo. Large Document Chunking - Auto-chunks Markdown above 300k tokens, preserves paragraph/sentence boundaries, and keeps tables intact (configured in llm_config.yml).


Stage 3: Mapping (Optional)

Link foreign keys between extracted records. Reads from output/extraction by default and writes to output/mapping.

Quick Start

# Link foreign keys (uses default input/output directories)
python -m mapping.scripts.mapping --apply

# Delete old mappings before re-running
python -m mapping.scripts.mapping --apply --delete-old

# With custom model
python -m mapping.scripts.mapping --apply --model gpt-4

Advanced Usage

# Map specific table only
python -m mapping.scripts.mapping --apply --only-table EmissionRecord

# Custom input/output directories
python -m mapping.scripts.mapping --apply \
  --input-dir extraction/output \
  --work-dir custom/mapping/dir

# Review mappings without applying
python -m mapping.scripts.mapping --review

Defaults

  • Input: output/extraction/ (extraction outputs)
  • Output: output/mapping/ (linked records)

Full Workflow Example

# 1. Extract from markdown
python -m extraction.scripts.extract \
  --markdown output/pdf2markdown/20260120_184105_ccc_leipzig/combined_markdown.md \
  --output-dir output/extraction \
  --overwrite

# 2. Map foreign keys (uses default dirs: output/extraction → output/mapping)
python -m mapping.scripts.mapping --apply --delete-old

# 3. Validate mappings
python -m mapping.scripts.mapping --review

Stage 4: Load Into DB (Optional)

# Validate only
python -m app.modules.db_insert.scripts.load_mapped_data --dry-run

# Insert after validation
python -m app.modules.db_insert.scripts.load_mapped_data

Requires DATABASE_URL (or DB_URL) in .env. Reports are written to output/db_load_reports/.

Test the DB connection:

python -m app.scripts.test_db_connection

Check row counts and sample rows:

python -m app.scripts.test_insert

Project Structure

project_root/
├── pdf2markdown/              # Stage 1: PDF → Markdown
├── extraction/                # Stage 2: Markdown → JSON
│   ├── prompts/              # LLM prompts by class
│   ├── tools/                # Tool definitions
│   ├── utils/                # Validation & parsing
│   ├── output/               # Extracted JSON files
│   └── extract.py            # Core logic
├── mapping/                   # Stage 3: Link records
├── database/
│   └── schemas.py            # Pydantic schemas (16 classes)
├── documents/                # Input PDFs
├── tests/
├── llm_config.yml            # Model configuration
├── run_pipeline.py           # Full pipeline
├── requirements.txt
└── README.md

Typical Workflows

Single Document

python -m pdf2markdown.pdf_to_markdown --input documents/my_city.pdf
python -m extraction.scripts.extract --markdown pdf2markdown/output/TIMESTAMP_my_city/combined_markdown.md
cat output/extraction/City.json

Batch Processing

python -m run_pipeline
# Results in: output/pdf2markdown/, output/extraction/, output/mapping/

Test Specific Class

python -m extraction.scripts.extract --markdown existing.md --class-names CityAnnualStats

Python API

PDF to Markdown

from pathlib import Path
from mistralai import Mistral
from openai import OpenAI
from pdf2markdown.pdf_to_markdown import pdf_to_markdown_mistral

mistral = Mistral(api_key="sk-...")
vision = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")

output = pdf_to_markdown_mistral(
    pdf_path=Path("documents/sample.pdf"),
    output_root=Path("pdf2markdown/output"),
    client=mistral,
    vision_client=vision,
    vision_model="google/gemini-3-flash-preview",
)

Extract Data

from openai import OpenAI
from extraction.extract import run_class_extraction
from database.schemas import City

client = OpenAI(api_key="sk-...", base_url="https://openrouter.ai/api/v1")

run_class_extraction(
    client=client,
    model_name="deepseek/deepseek-v3.2",
    system_prompt="...",
    user_template="...",
    markdown_text=markdown_content,
    model_cls=City,
    output_dir=Path("extraction/output"),
)

Troubleshooting

Issue Solution
Missing Mistral API key Set MISTRAL_API_KEY in .env
Vision refinement fails Check OPENROUTER_API_KEY in .env
Missing required field: year Markdown may lack year info, check extraction/debug_logs/
OpenRouter API error Verify API key has credits
Large PDF timeout Use --max-upload-bytes 5242880 to split into 5MB chunks

Recent Fixes (January 2026)

✅ Model Selection - Switched extraction to deepseek/deepseek-v3.2 for superior tool calling (was generating empty objects with google/gemini)

✅ Year Extraction - Enhanced CityAnnualStats.md prompt with explicit examples for extracting years from varied text patterns

✅ Error Messages - Improved validation feedback to show exactly which fields are missing and what data was received


Architecture Details

PDF → Markdown Flow

PDF → [Mistral OCR] → Markdown + Images
  ↓
[2-Page Windows] → {image_left, markdown_left, image_right, markdown_right}
  ↓
[Vision Agent] → Tool calls → Edits
  ↓
Final Markdown ✓

How to Extend

  1. New extraction class? Add model to database/schemas.py and prompt to extraction/prompts/
  2. Different PDF pipeline? Modify pdf2markdown/pdf_to_markdown.py
  3. Custom mapping? Edit mapping/mappers/

References


License

See LICENSE.md

Last Updated: January 16, 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages