-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
The COSMOS stack depends on PDFMiner for parsing information and text from the file, but does not gracefully handle upstream issues. For example, xDD document 645bdbb714b4ac75a2c0e11d appears to be a scanned PDF with a potentially invalid text layer embedded. The COSMOS pipeline silent fails, because it simply swallows errors coming from the page parsing:
...
13:52:27.594712 parse pdf ./input/645bdbb714b4ac75a2c0e11d.pdf
Traceback (most recent call last):
File "make_parquet.py", line 763, in <module>
stats = main_process(pdf_dir, page_info_dir, out_dir)
File "make_parquet.py", line 714, in main_process
meta, limit = parse_pdf(filename)
File "/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 841, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 854, in render_content
self.execute(list_value(streams))
File "/usr/local/lib/python3.8/dist-packages/pdfminer/pdfinterp.py", line 869, in execute
name = keyword_name(obj).decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
root@2bdf8f4ccadc:/ingestion# rm input/
(output from running make_parquets.py in ingest image)
We should:
- Propagate this exception upward so that it's clear why a PDF is failing
- Try to do an in-place correction -- in this case, running Tesseract to embed a useful text layer at the PDF level.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels