Skip to content

A user-friendly example for a scanned multipage PDF needed #67

@FilipDominec

Description

@FilipDominec

Thank you for this interesting project, which seems to exactly fit my needs, but so far I could not make it work. It the README.md, there is an example command like, but its use is far from straightforward.

Running just recode_pdf --from-pdf scan.pdf --out-pdf TEST.pdf without any hOCR file throws a confusing AttributeError: 'NoneType' object has no attribute 'seek'. Actually I tried to reinstall with three different versions and came here to report a bug.

Then I found another line in the README.md that "It is not possible to recode/compress a PDF without hOCR files". This is a crucial piece of information, but it is somewhat hidden. It is also not easy to find how to generate such a necessary file.

A google search suggested that I can use tesseract scan.tif scan hocr to generate hOCR file from a TIF. This would help for a single TIF file, but apparently tesseract does not accept PDF format.

I suggest that

  1. README should contain a minimum working example for an ordinary computer savvy user, who followed the Installation instructions and just wants to try recoding a scanned PDF file.
  2. The scripts should check for the hOCR file - and if it is missing, print out a sensible message about it (and possibly how to generate it).
  3. If possible, such a hOCR file could even be auto-generated on the fly whenever not provided by the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions