DocStruct

PDF document structure recovery system using parser-OCR cross-validation.

Overview

DocStruct extracts structured content from PDF documents by combining two independent analysis paths: a parser track that analyzes embedded text and fonts, and an OCR track that processes rendered page images. The fusion engine merges both hypotheses, resolving conflicts and assigning confidence scores.

Features

Block Type Classification: Automatically detects and classifies text, tables, figures, and math equations
Dual-Track Analysis: Parser-based and OCR-based layout hypotheses
Confidence Scoring: Each element tagged with provenance (parser/ocr/fused) and confidence (0-1)
Multiple Export Formats:
- JSON: Structured data with full metadata
- Markdown: Text with embedded images for tables/figures
- TXT: Plain text with block type annotations
- HTML: Interactive debug viewer
LaTeX OCR: Extracts mathematical equations as LaTeX using pix2tex

Installation

Requirements

Rust 1.93.0+
Python 3.12+
poppler-utils (pdfinfo, pdftotext, pdftoppm)
tesseract 5.3+

Setup

# Install system dependencies (Ubuntu/Debian)
sudo apt install poppler-utils tesseract-ocr

# Install Python dependencies
pip install -r requirements.txt

# Build
cargo build --release

Usage

./target/release/docstruct <input.pdf> --out <output_dir> --dpi 200

Output Files

output_dir/
├── document.json         # Structured data (all blocks, lines, spans)
├── document.md          # Markdown with embedded images
├── document.txt         # Plain text with block markers
├── page_001.md          # Per-page markdown
├── figures/
│   └── page_NNN_TYPE__NN.png  # Extracted images
└── debug/
    ├── page_001.html    # Interactive debug viewer
    └── page_001.png     # Rendered page image

Architecture

Pipeline

Parser Track: Extract text positions from PDF internal structure
OCR Track: Render pages to images, detect blocks, classify types, run OCR
Fusion: Align blocks, compare content, resolve conflicts, assign confidence
Export: Generate JSON, Markdown, TXT, and HTML outputs

Block Classification

The OCR bridge classifies blocks based on visual features:

Math: Pattern matching (∫∑∏∂∇, Greek letters, function names) + symbol density
Figure: High edge density (>0.08) for graphics and diagrams
Table: Grid structure detection (horizontal/vertical line density)
Text: Default classification

Coordinate System

All coordinates are in page pixel space based on the rendering DPI (default 200).

Document Schema

{
  "pages": [
    {
      "page_idx": 0,
      "class": "digital",
      "width": 1000,
      "height": 1400,
      "blocks": [
        {
          "type": "TextBlock",
          "bbox": {"x0": 10.0, "y0": 20.0, "x1": 400.0, "y1": 80.0},
          "lines": [{"spans": [...]}],
          "confidence": 0.85,
          "source": "fused"
        },
        {
          "type": "MathBlock",
          "bbox": {"x0": 50.0, "y0": 100.0, "x1": 300.0, "y1": 150.0},
          "latex": "\\int_{0}^{\\infty} e^{-x} dx",
          "confidence": 0.72,
          "source": "ocr"
        }
      ]
    }
  ]
}

Project Structure

src/
  core/           # Geometry, data models, confidence scoring
  parser/         # PDF text extraction and layout analysis
  ocr/            # Image rendering, OCR bridge, layout building
  fusion/         # Hypothesis alignment and conflict resolution
  export/         # JSON, Markdown, TXT, HTML exporters
ocr/bridge/       # Python OCR integration (Tesseract, pix2tex)
test/             # Test documents
docs/             # Architecture and implementation documentation

Debug Viewer

The HTML debug viewer provides interactive visualization:

Color-coded blocks by type (text/table/figure/math)
Click blocks to view parser text, OCR text, confidence, and similarity scores
Toggle between parser, OCR, and fused hypotheses

Configuration

Key parameters in ocr/bridge/ocr_bridge.py:

detect_blocks(
    min_area=2000,           # Minimum block area in pixels
    merge_kernel=(15, 10)    # Morphological kernel for block merging
)

Classification thresholds:

Math: symbol_density > 0.2 or 2+ pattern matches
Figure: edge_density > 0.08 and area > 50000
Table: h_density > 0.01 and v_density > 0.01

Testing

# Unit tests
cargo test

# Integration test
./target/release/docstruct test/test_document.pdf --out test_output --dpi 200

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
ocr/bridge		ocr/bridge
src		src
test		test
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
install.sh		install.sh
main.py		main.py
pdfocr		pdfocr
pdfocr-docker		pdfocr-docker
requirements.txt		requirements.txt
setup.sh		setup.sh
test-docker.fish		test-docker.fish
test.fish		test.fish

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocStruct

Overview

Features

Installation

Requirements

Setup

Usage

Output Files

Architecture

Pipeline

Block Classification

Coordinate System

Document Schema

Project Structure

Debug Viewer

Configuration

Testing

License

About

Uh oh!

Releases

Packages

Languages

zeetee1235/DocStruct

Folders and files

Latest commit

History

Repository files navigation

DocStruct

Overview

Features

Installation

Requirements

Setup

Usage

Output Files

Architecture

Pipeline

Block Classification

Coordinate System

Document Schema

Project Structure

Debug Viewer

Configuration

Testing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages