Skip to content

fix: remove duplicate characters caused by fake bold rendering in PDFs#4215

Open
bittoby wants to merge 13 commits intoUnstructured-IO:mainfrom
bittoby:fix/remove-pdf-bold-text-duplication
Open

fix: remove duplicate characters caused by fake bold rendering in PDFs#4215
bittoby wants to merge 13 commits intoUnstructured-IO:mainfrom
bittoby:fix/remove-pdf-bold-text-duplication

Conversation

@bittoby
Copy link

@bittoby bittoby commented Jan 28, 2026

Closes #3864

Summary

  • Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
  • Some PDF generators simulate bold by rendering each character twice at slightly offset positions
  • Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

  • Compares consecutive characters' text content and position
  • Removes duplicates where same character appears within 3 pixels (configurable)
  • Preserves spaces and other non-character elements (LTAnno objects)
# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

@bittoby
Copy link
Author

bittoby commented Jan 28, 2026

@badGarnet Could you please review this PR? Thanks!

@badGarnet
Copy link
Collaborator

@badGarnet Could you please review this PR? Thanks!

Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file.

@bittoby
Copy link
Author

bittoby commented Jan 30, 2026

@badGarnet
I added example pdf(example-docs/pdf/fake-bold-sample.pdf) and test script(diagnose_fake_bold.py) for diagnose fake bolds.
please review and test again. Thank you

Comment on lines 281 to 284
assert len(text_with_dedup) <= len(text_no_dedup), (
f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer "
f"than non-deduplicated text ({len(text_no_dedup)} chars)"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better assert would be:

  • checking the exact expected text length
  • check there is duplicated characters in the text_no_dedup (like bboolldd) and normal text in text_with_dedupe (like bold)

@@ -0,0 +1,69 @@
"""Diagnostic script to verify fake-bold PDF deduplication is working."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a test against the new file is good enough; we don't need to add a script to root dir for this case

@bittoby
Copy link
Author

bittoby commented Feb 2, 2026

@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again!

@bittoby bittoby force-pushed the fix/remove-pdf-bold-text-duplication branch from 29d32e5 to 355e925 Compare February 3, 2026 17:48
@bittoby
Copy link
Author

bittoby commented Feb 4, 2026

Hi, @badGarnet . I updated all. Hope you merge this when you have a sec

@badGarnet badGarnet enabled auto-merge February 5, 2026 16:50
auto-merge was automatically disabled February 5, 2026 17:39

Head branch was pushed to by a user without write access

@bittoby
Copy link
Author

bittoby commented Feb 5, 2026

@badGarnet Thanks for approval. Can you merge the PR!

@bittoby
Copy link
Author

bittoby commented Feb 5, 2026

Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks.

Copy link
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the changelog and move your entry to the appropriate section; please also bump the version number

@bittoby
Copy link
Author

bittoby commented Feb 6, 2026

I updated changelog and bumped version number

@bittoby bittoby requested a review from badGarnet February 6, 2026 00:40
CHANGELOG.md Outdated
- **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489)

### Fixes
- **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

@@ -1 +1 @@
## 0.18.35-dev0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please bump here as well

@bittoby
Copy link
Author

bittoby commented Feb 6, 2026

Sorry, @badGarnet - I misunderstood. I’ve now updated CHANGELOG.md and bumped the version correctly. Could you please check again? Thanks for taking a look.

@bittoby bittoby requested a review from badGarnet February 6, 2026 16:30
@bittoby
Copy link
Author

bittoby commented Feb 6, 2026

@badGarnet Thanks for approving. 👍 Could you merge this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Bold characters get repeated while extracting

2 participants