-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Initial Checks
- I confirm that I'm on the latest version
Description
It seems that there's a problem if the pdf file contains both pngs and jpgs. In these case it seems that pngs cannot be detected. Here's below an example pdf file.
op.pdf
Here's below the output of the sample code
a8813dd5-1d88-4c42-9a0f-9a8d7149d3b6
['text']
List of metropolitan areas in Europe
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
6f2a69c7-33a3-414e-aeba-20a03b78fc68
['text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
e43c1deb-d81e-4685-a773-89ee5368f65e
['image', 'text']
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Number of chunks: 3
Example Code
import openparse
import json
basic_doc_path = "op.pdf"
parser = openparse.DocumentParser(
table_args={
"parsing_algorithm": "pymupdf",
"table_output_format": "markdown"
}
)
parsed_basic_doc = parser.parse(basic_doc_path)
chunks = parsed_basic_doc.model_dump_json()
chunks = json.loads(chunks)
for node in chunks['nodes']:
print(node['node_id'])
print(node['variant'])
print(node['text'])
print('Number of chunks:', len(parsed_basic_doc.nodes))Python, open-parse & OS Version
python_version: 3.11.2
operating_system: Linux
os_version: 6.1.0-27-amd64
open-parse version: 0.7.0
install path: /home/tulas/Projects/tmpop/env/lib/python3.11/site-packages/openparse
python version: 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0]
platform: Linux-6.1.0-27-amd64-x86_64-with-glibc2.36
related packages: pydantic-2.9.2 PyMuPDF-1.24.13 tokenizers-0.19.1