Header-like text labeled as text in PDF output #2920

anilkumar3313 · 2026-01-26T15:07:21Z

anilkumar3313
Jan 26, 2026

In some PDFs (e.g., SEC filings), visually header-like lines are labeled as text instead of section_header, and there’s no hierarchy between headers.

Is this expected behavior for PDFs? Any recommended way to detect header hierarchy or promote such lines to headers?

@anilkumar3313 · 2026-01-26T15:11:43Z

dosubot[bot]
bot Jan 26, 2026

Hi @anilkumar3313! I'm Dosu and I’m helping the docling team.

Yes, this is expected behavior for PDFs. Unlike DOCX or HTML, PDFs usually lack explicit, machine-readable structure for headers—so visually header-like lines are often labeled as plain text or as SECTION_HEADER without any hierarchy. Docling’s layout model relies on visual cues and model predictions, but if a line isn’t confidently recognized as a header, it will be labeled as TEXT, and all detected headers default to a single level with no hierarchy. There’s no built-in configuration or heuristic in Docling to promote such lines to headers or infer header hierarchy out of the box [source].

To detect header hierarchy or promote header-like lines to headers in PDFs, the recommended approach is to use the community package docling-hierarchical-pdf. It works as a post-processing step, inferring heading levels using heuristics like font size, boldness, numbering, and PDF bookmarks if available. For PDFs with a table of contents or bookmarks, you can also extract the TOC using PyMuPDF and use it to reconstruct the hierarchy in post-processing [source].

Other options include custom post-processing based on regex or font/style analysis, or using LLM-based re-ranking on the Markdown output, but these require document-specific tuning and aren’t generalizable [source].

If you need more targeted advice, sharing details about your workflow or sample PDFs would help narrow down the best approach.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

anilkumar3313 · 2026-01-26T15:42:41Z

anilkumar3313
Jan 26, 2026
Author

@dosu,
Thanks for the detailed explanation — that clarifies the expected behavior for PDFs.

To share more context, SEC 10-K / 10-Q filings (including draft versions), and my goal is to generate a machine-readable JSON + Markdown representation with proper document hierarchy, suitable for downstream retrieval and analytics.

Current workflow

Parse the PDF using Docling to obtain layout blocks (TEXT, tables, images, etc.)
Post-process the output to reconstruct section and subsection hierarchy

Target output structure

section
├─ sub_section
│ ├─ elements: [text, tables, bullet points, images]
│ ├─ associated footnotes
│ ├─ start_page
│ └─ end_page

Key requirements

Preserve section order exactly as it appears in the input PDF
Promote visually header-like lines currently labeled as TEXT
Treat standardized SEC “Item X / Item XA” headings as top-level sections
Group tables, figures, and footnotes under the nearest logical section

Additional constraint
Some draft 10-K / 10-Q PDFs do not contain bookmarks or a Table of Contents, so I need a robust fallback strategy when no explicit TOC structure is available.

Based on this, I wanted to confirm:

Is docling-hierarchical-pdf the recommended approach for inferring section/subsection hierarchy for SEC filings, especially when Docling output does not expose font metrics?
In the absence of TOC/bookmarks, is using SEC “Item” headings as H1 anchors + heuristics (numbering, layout, typography) the intended fallback approach?
When grouping blocks under sections and computing start_page / end_page, are there specific Docling fields (e.g., bbox overlap, reading order, provenance) you recommend prioritizing?

3 replies

dosubot[bot] bot Jan 26, 2026

Yes, docling-hierarchical-pdf is the recommended way to infer section and subsection hierarchy for SEC filings parsed with Docling, especially when font metrics aren’t exposed. It’s designed as a post-processing step and works with both scanned and text-based PDFs. It uses heuristics like font size, boldness, numbering, and bookmarks if available, but crucially, it also works when those cues are missing or unreliable—making it a good fit for draft 10-K/10-Qs without TOC or bookmarks. This approach is widely used for SEC filings and similar documents where structure must be reconstructed from layout and content cues rather than explicit metadata [source, source, source].

When TOC/bookmarks are missing, using SEC “Item” headings (like “Item 1”, “Item 1A”, etc.) as H1 anchors, combined with heuristics based on numbering, layout, and typography, is the intended and community-endorsed fallback. docling-hierarchical-pdf does this out of the box, promoting standardized SEC headings and grouping content accordingly. Regex-based chunkers or custom scripts can supplement this if you need to tune for edge cases, but most users find the package’s defaults sufficient for SEC filings [source, source].

For grouping blocks under sections and computing start_page/end_page, prioritize these Docling fields:

prov (provenance): Each block (text, table, image, etc.) carries provenance entries with page number and bounding box (bbox). Use these to determine where a section starts and ends by aggregating the provenance of all child elements.
bbox (bounding box): Use for spatial grouping—blocks that are close or adjacent in layout are likely to belong together. The BoundingBox class provides methods for overlap, intersection, and union, which are useful for grouping and for computing section boundaries.
Parent/children tree: After running docling-hierarchical-pdf, the DoclingDocument’s tree structure will reflect the inferred hierarchy. Section order is preserved by the order of items in the tree, and grouping is managed by parent/child relationships.
Reading order: Provenance and bbox are also used by Docling’s reading order visualizer and post-processing tools to ensure elements are sequenced as in the original PDF [source, source].

Typical workflow:

Parse the PDF with Docling to get a DoclingDocument.
Run docling-hierarchical-pdf’s postprocessor (e.g., ResultPostprocessor(result).process()) to infer hierarchy and update the DoclingDocument in-place.
Export to Markdown or JSON—the hierarchy, section grouping, and page boundaries will now be reflected in the output.

If you want to experiment with alternatives, madin is another tool for imposing hierarchy on Markdown, but for SEC filings and Docling output, docling-hierarchical-pdf is the most robust and widely used solution [source, source].

Let me know if you want a code example for integrating docling-hierarchical-pdf into your pipeline!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

anilkumar3313 Jan 26, 2026
Author

@dosu Thanks — I’m using docling-hierarchical-pdf, but on 10-Q/10-K it often promotes financial statement titles and unit lines as headings (e.g., “CONSOLIDATED STATEMENTS OF INCOME”, “(Dollars in thousands)”, “(Unaudited)”, “(continued)”), which creates a messy hierarchy.
Is there a recommended way to post-filter / override the inferred headers from docling-hierarchical-pdf
Also, for SEC specifically, would you recommend treating Item/Part as the only H1 anchors and then demoting “statement titles” into content elements, with Notes as H2 only under a Notes context? Any best-practice examples/repos you can point to?

dosubot[bot] bot Jan 26, 2026

Yes, over-promotion of financial statement titles and unit lines as headings is a known issue with docling-hierarchical-pdf—there’s no built-in config to suppress these, so post-processing is the standard approach for SEC filings. The best practice is to treat only standardized “Item” or “Part” headings (like “Item 1”, “Item 1A”, “Part II”) as H1 anchors, demote statement titles (e.g., “CONSOLIDATED STATEMENTS OF INCOME”, “(Dollars in thousands)”, “(Unaudited)”, “(continued)”) to content, and treat Notes as H2 only under a Notes context. This is typically done with a custom script after running docling-hierarchical-pdf [source, source].

Here’s a code pattern for post-filtering in Python using the DoclingDocument API:

import re
from copy import deepcopy

# Example patterns for unwanted headings
UNWANTED_HEADER_PATTERNS = [
    r"CONSOLIDATED STATEMENTS? OF .*",
    r"\(Dollars in thousands\)",
    r"\(Unaudited\)",
    r"\(continued\)",
    # Add more as needed
]

def is_unwanted_header(text):
    return any(re.fullmatch(pat, text.strip(), re.IGNORECASE) for pat in UNWANTED_HEADER_PATTERNS)

def filter_headers(doc):
    doc_copy = deepcopy(doc)
    to_remove = []
    for item, _ in doc_copy.iterate_items():
        if getattr(item, "label", None) == "section_header" and is_unwanted_header(getattr(item, "text", "")):
            to_remove.append(item)
    doc_copy.delete_items(node_items=to_remove)
    return doc_copy

# Usage:
# doc = ...  # your DoclingDocument after docling-hierarchical-pdf
# doc_clean = filter_headers(doc)

This approach safely removes unwanted headers and keeps the document tree valid [source].

If you want to demote a header (e.g., turn it into a plain text block instead of deleting), you can use replace_item to swap the SectionHeaderItem for a TextItem at the same position.

For best-practice hierarchy: treat only “Item/Part” as H1, demote statement titles, and handle Notes as H2 under a Notes context. This logic is usually implemented in the same post-processing script, using regex to match “Item X” as H1 and “Note” as H2 only within the Notes section [source].

There’s no public repo with a full SEC-specific pipeline, but you’ll find practical code and discussion in these issues: #287, #2335, #2037, #1023. If you want to further refine Markdown output, the madin library can impose hierarchical headings post-export.

Let me know if you want a more detailed example for demoting headers or handling Notes!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header-like text labeled as text in PDF output #2920

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Header-like text labeled as text in PDF output #2920

Uh oh!

anilkumar3313 Jan 26, 2026

Replies: 2 comments · 3 replies

Uh oh!

dosubot[bot] bot Jan 26, 2026

Uh oh!

anilkumar3313 Jan 26, 2026 Author

Uh oh!

dosubot[bot] bot Jan 26, 2026

Uh oh!

anilkumar3313 Jan 26, 2026 Author

Uh oh!

dosubot[bot] bot Jan 26, 2026

anilkumar3313
Jan 26, 2026

Replies: 2 comments 3 replies

dosubot[bot]
bot Jan 26, 2026

anilkumar3313
Jan 26, 2026
Author

anilkumar3313 Jan 26, 2026
Author