Recognize, capture, and correct underlying PDF parsing issues

The COSMOS stack depends on PDFMiner for parsing information and text from the file, but does not gracefully handle upstream issues. For example, xDD document `645bdbb714b4ac75a2c0e11d` appears to be a scanned PDF with a potentially invalid text layer embedded. The COSMOS pipeline silent fails, because it simply swallows errors coming from the page parsing:
```
...
13:52:27.594712 parse pdf ./input/645bdbb714b4ac75a2c0e11d.pdf
Traceback (most recent call last):
  File "make_parquet.py", line 763, in <module>
    stats = main_process(pdf_dir, page_info_dir, out_dir)
  File "make_parquet.py", line 714, in main_process
    meta, limit = parse_pdf(filename)
  File "/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
    interpreter.process_page(page)
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 854, in render_content
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
root@2bdf8f4ccadc:/ingestion# rm input/
```
(output from running `make_parquets.py` in ingest image)

We should:
- [ ] Propagate this exception upward so that it's clear why a PDF is failing
- [ ] Try to do an in-place correction -- in this case, running Tesseract to embed a useful text layer at the PDF level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize, capture, and correct underlying PDF parsing issues #180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recognize, capture, and correct underlying PDF parsing issues #180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions