Header-like text labeled as text in PDF output #2920
Replies: 2 comments 3 replies
-
|
Hi @anilkumar3313! I'm Dosu and I’m helping the docling team. Yes, this is expected behavior for PDFs. Unlike DOCX or HTML, PDFs usually lack explicit, machine-readable structure for headers—so visually header-like lines are often labeled as plain text or as SECTION_HEADER without any hierarchy. Docling’s layout model relies on visual cues and model predictions, but if a line isn’t confidently recognized as a header, it will be labeled as TEXT, and all detected headers default to a single level with no hierarchy. There’s no built-in configuration or heuristic in Docling to promote such lines to headers or infer header hierarchy out of the box [source]. To detect header hierarchy or promote header-like lines to headers in PDFs, the recommended approach is to use the community package docling-hierarchical-pdf. It works as a post-processing step, inferring heading levels using heuristics like font size, boldness, numbering, and PDF bookmarks if available. For PDFs with a table of contents or bookmarks, you can also extract the TOC using PyMuPDF and use it to reconstruct the hierarchy in post-processing [source]. Other options include custom post-processing based on regex or font/style analysis, or using LLM-based re-ranking on the Markdown output, but these require document-specific tuning and aren’t generalizable [source]. If you need more targeted advice, sharing details about your workflow or sample PDFs would help narrow down the best approach. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
@dosu, To share more context, SEC 10-K / 10-Q filings (including draft versions), and my goal is to generate a machine-readable JSON + Markdown representation with proper document hierarchy, suitable for downstream retrieval and analytics. Current workflow
Target output structure section Key requirements
Additional constraint Based on this, I wanted to confirm:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In some PDFs (e.g., SEC filings), visually header-like lines are labeled as text instead of section_header, and there’s no hierarchy between headers.
Is this expected behavior for PDFs? Any recommended way to detect header hierarchy or promote such lines to headers?
Beta Was this translation helpful? Give feedback.
All reactions