Skip to content

Recognize, capture, and correct underlying PDF parsing issues #180

@iross

Description

@iross

The COSMOS stack depends on PDFMiner for parsing information and text from the file, but does not gracefully handle upstream issues. For example, xDD document 645bdbb714b4ac75a2c0e11d appears to be a scanned PDF with a potentially invalid text layer embedded. The COSMOS pipeline silent fails, because it simply swallows errors coming from the page parsing:

...
13:52:27.594712 parse pdf ./input/645bdbb714b4ac75a2c0e11d.pdf
Traceback (most recent call last):
  File "make_parquet.py", line 763, in <module>
    stats = main_process(pdf_dir, page_info_dir, out_dir)
  File "make_parquet.py", line 714, in main_process
    meta, limit = parse_pdf(filename)
  File "/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
    interpreter.process_page(page)
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 854, in render_content
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
root@2bdf8f4ccadc:/ingestion# rm input/

(output from running make_parquets.py in ingest image)

We should:

  • Propagate this exception upward so that it's clear why a PDF is failing
  • Try to do an in-place correction -- in this case, running Tesseract to embed a useful text layer at the PDF level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions