Integrate docling-hierarchical-pdf into docling #2591

krrome · 2025-11-06T20:43:44Z

krrome
Nov 6, 2025

Hi all,

I would like to continue with my aim to integrate https://github.com/krrome/docling-hierarchical-pdf into docling resolving some open issues of this repo. From the conversations I have had with @PeterStaar-IBM I understood that the integration would also be in your interest.

I'll start the conversation by pointing out why I didn't already open a PR:

the header level inference based on PDF-bookmarks/table of contents (PDF-metadata) requires a method that extracts this metadata from the PDF. I think the PDF library you are using for parsing would also be able to do that, but I wanted to check with you first before I attempt a PR for the docling-parse repo.
the header level inference based on styles and font size (bounding boxes) performs clustering to clean up slight variations in size, so it is necessary to have the full, parsed document first before any header level can be assigned
the header level inference based on numbering styles also requires the whole document to be parsed first, as it has to estimate (simple threshold) whether the document is likely to have header numbering.

In the current PDF pipelines (standard and VLM) I couldn't really find "the right spot" where I could integrate my code, so I went for a "postprocessor" which has the downside that, once header level inference is done, I have to walk through the whole document and reassign doc-item parents, which I guess, comes at a risk of messing up the document structure. Also it doesn't seem very neat and tidy.

I once had proposed a solution/hack that added a processing step in the standard pipeline after layout processing, that would apply the header levels from PDF metadata, but that didn't seem like a clean approach either, also it would have to be solved for the VLM pipeline seperately.

I am looking forward to hearing your thoughts. I'm sure you have more/better ideas :)

PeterStaar-IBM · 2025-11-07T04:28:31Z

PeterStaar-IBM
Nov 7, 2025
Maintainer

@krrome Thanks for picking this up! Here is my suggestion:

Look at the ReadingOrderModel class:

Definition: link
Application: link

I think we could just add a new class HierarchyModel in ./docling/models/hierarchy.py

class HierarchyModel:

def __call__(self, conv_res: ConversionResult) -> DoclingDocument:
   ...

and then apply it after applying the

conv_res.document = self.reading_order_model(conv_res)
conv_res.document = self.hierarchy_model(conv_res)

PS: I am open to another name as HierarchyModel.

FYI: @cau-git

1 reply

krrome Nov 9, 2025
Author

Thank you @PeterStaar-IBM for your quick response. The proposed solution is similar to some hack that I proposed here: #287 (comment). The problem that I saw with that is that there is more than just the StandardPdfPipeline to convert PDFs to DoclingDocuments.
I am not enough aware of all the optimisation options and "supported" ways to modify pipelines, so I'll need input from the docling team on that. I am aware of

docling/docling/pipeline/standard_pdf_pipeline.py

Line 371 in c21327c

class StandardPdfPipeline(ConvertPipeline):

and

docling/docling/pipeline/vlm_pipeline.py

Line 51 in c21327c

class VlmPipeline(PaginatedPipeline):

as ways to turn PDF files into DoclingDocuments using docling. Am I missing something?

By modifying pipelines individually, the inference of header hierarchy will have to be included explicitly in the pipeline. From my point of view this makes sense if this is also consistent with the vision you have for document pipelines. I would then seperate out the inference part from the integration into the pipelines entirely to ensure modularity.

Regarding extracting the PDF-ToC I just realised that I must have missed the fact that docling parse actually does have the functionality to extract ToC - I will obviously try to use that one instead of my current implementation.

PeterStaar-IBM · 2025-11-10T13:00:47Z

PeterStaar-IBM
Nov 10, 2025
Maintainer

@krrome I would only put the level updates on the StandardPdfPipeline. In principle, we want the VLM to predict the right level.

0 replies

krrome · 2025-11-11T06:04:22Z

krrome
Nov 11, 2025
Author

Ok, I understand, I will at least start with the implementation for the StandardPdfPipeline. I recently pushed the HRDOC dataset through VLM pipeline with default settings and will have a look whether the quality of the conversion results is improved by applying header inference or if the currently available smoldocling VLM is capable of predicting the right level. I get that it is an asymptotic goal that VLMs will extract all hierarchy and text correctly on their own.

Of course you have to decide in the end what code and functionality you want to integrate into your codebase, I am just proposing different options.

I should have time to start work on the implementation by the end of this week.

0 replies

krrome · 2025-11-24T19:26:40Z

krrome
Nov 24, 2025
Author

I have now finally found the time to finish a first draft of how I propose to integrate the hierarchy inference directly into the reading order model: #2676
I will keep working on it (fixing tests, extending to full funcitonality), but would appreciate feedback already now since the draft already shows how I am proposing to integrate the full functionality.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate docling-hierarchical-pdf into docling #2591

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Integrate docling-hierarchical-pdf into docling #2591

Uh oh!

krrome Nov 6, 2025

Replies: 4 comments · 1 reply

Uh oh!

PeterStaar-IBM Nov 7, 2025 Maintainer

Uh oh!

krrome Nov 9, 2025 Author

Uh oh!

PeterStaar-IBM Nov 10, 2025 Maintainer

Uh oh!

Uh oh!

krrome Nov 11, 2025 Author

Uh oh!

krrome Nov 24, 2025 Author

krrome
Nov 6, 2025

Replies: 4 comments 1 reply

PeterStaar-IBM
Nov 7, 2025
Maintainer

krrome Nov 9, 2025
Author

PeterStaar-IBM
Nov 10, 2025
Maintainer

krrome
Nov 11, 2025
Author

krrome
Nov 24, 2025
Author