PDF document structure recovery system using parser-OCR cross-validation.
DocStruct extracts structured content from PDF documents by combining two independent analysis paths: a parser track that analyzes embedded text and fonts, and an OCR track that processes rendered page images. The fusion engine merges both hypotheses, resolving conflicts and assigning confidence scores.
- Block Type Classification: Automatically detects and classifies text, tables, figures, and math equations
- Dual-Track Analysis: Parser-based and OCR-based layout hypotheses
- Confidence Scoring: Each element tagged with provenance (parser/ocr/fused) and confidence (0-1)
- Multiple Export Formats:
- JSON: Structured data with full metadata
- Markdown: Text with embedded images for tables/figures
- TXT: Plain text with block type annotations
- HTML: Interactive debug viewer
- LaTeX OCR: Extracts mathematical equations as LaTeX using pix2tex
- Rust 1.93.0+
- Python 3.12+
- poppler-utils (pdfinfo, pdftotext, pdftoppm)
- tesseract 5.3+
# Install system dependencies (Ubuntu/Debian)
sudo apt install poppler-utils tesseract-ocr
# Install Python dependencies
pip install -r requirements.txt
# Build
cargo build --release./target/release/docstruct <input.pdf> --out <output_dir> --dpi 200output_dir/
├── document.json # Structured data (all blocks, lines, spans)
├── document.md # Markdown with embedded images
├── document.txt # Plain text with block markers
├── page_001.md # Per-page markdown
├── figures/
│ └── page_NNN_TYPE__NN.png # Extracted images
└── debug/
├── page_001.html # Interactive debug viewer
└── page_001.png # Rendered page image
- Parser Track: Extract text positions from PDF internal structure
- OCR Track: Render pages to images, detect blocks, classify types, run OCR
- Fusion: Align blocks, compare content, resolve conflicts, assign confidence
- Export: Generate JSON, Markdown, TXT, and HTML outputs
The OCR bridge classifies blocks based on visual features:
- Math: Pattern matching (∫∑∏∂∇, Greek letters, function names) + symbol density
- Figure: High edge density (>0.08) for graphics and diagrams
- Table: Grid structure detection (horizontal/vertical line density)
- Text: Default classification
All coordinates are in page pixel space based on the rendering DPI (default 200).
{
"pages": [
{
"page_idx": 0,
"class": "digital",
"width": 1000,
"height": 1400,
"blocks": [
{
"type": "TextBlock",
"bbox": {"x0": 10.0, "y0": 20.0, "x1": 400.0, "y1": 80.0},
"lines": [{"spans": [...]}],
"confidence": 0.85,
"source": "fused"
},
{
"type": "MathBlock",
"bbox": {"x0": 50.0, "y0": 100.0, "x1": 300.0, "y1": 150.0},
"latex": "\\int_{0}^{\\infty} e^{-x} dx",
"confidence": 0.72,
"source": "ocr"
}
]
}
]
}src/
core/ # Geometry, data models, confidence scoring
parser/ # PDF text extraction and layout analysis
ocr/ # Image rendering, OCR bridge, layout building
fusion/ # Hypothesis alignment and conflict resolution
export/ # JSON, Markdown, TXT, HTML exporters
ocr/bridge/ # Python OCR integration (Tesseract, pix2tex)
test/ # Test documents
docs/ # Architecture and implementation documentation
The HTML debug viewer provides interactive visualization:
- Color-coded blocks by type (text/table/figure/math)
- Click blocks to view parser text, OCR text, confidence, and similarity scores
- Toggle between parser, OCR, and fused hypotheses
Key parameters in ocr/bridge/ocr_bridge.py:
detect_blocks(
min_area=2000, # Minimum block area in pixels
merge_kernel=(15, 10) # Morphological kernel for block merging
)Classification thresholds:
- Math: symbol_density > 0.2 or 2+ pattern matches
- Figure: edge_density > 0.08 and area > 50000
- Table: h_density > 0.01 and v_density > 0.01
# Unit tests
cargo test
# Integration test
./target/release/docstruct test/test_document.pdf --out test_output --dpi 200MIT