Skip to content

JPGs and PNGs images in the PDF #84

@tulas75

Description

@tulas75

Initial Checks

  • I confirm that I'm on the latest version

Description

It seems that there's a problem if the pdf file contains both pngs and jpgs. In these case it seems that pngs cannot be detected. Here's below an example pdf file.
op.pdf

Here's below the output of the sample code

a8813dd5-1d88-4c42-9a0f-9a8d7149d3b6
['text']
List of metropolitan areas in Europe
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

6f2a69c7-33a3-414e-aeba-20a03b78fc68
['text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

e43c1deb-d81e-4685-a773-89ee5368f65e
['image', 'text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Number of chunks: 3

Example Code

import openparse
import json
basic_doc_path = "op.pdf"
parser = openparse.DocumentParser(
     table_args={
     "parsing_algorithm": "pymupdf",
     "table_output_format": "markdown"
    }
)
parsed_basic_doc = parser.parse(basic_doc_path)

chunks = parsed_basic_doc.model_dump_json()
chunks = json.loads(chunks)
for node in chunks['nodes']:
    print(node['node_id'])
    print(node['variant'])
    print(node['text'])

print('Number of chunks:', len(parsed_basic_doc.nodes))

Python, open-parse & OS Version

python_version: 3.11.2
             operating_system: Linux
                   os_version: 6.1.0-27-amd64
           open-parse version: 0.7.0
                 install path: /home/tulas/Projects/tmpop/env/lib/python3.11/site-packages/openparse
               python version: 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0]
                     platform: Linux-6.1.0-27-amd64-x86_64-with-glibc2.36
             related packages: pydantic-2.9.2 PyMuPDF-1.24.13 tokenizers-0.19.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions