Skip to content

rework numeric tokenizer hot path#1104

Merged
BobLd merged 1 commit intomasterfrom
improve-numeric-tokenizer-performance
Jul 25, 2025
Merged

rework numeric tokenizer hot path#1104
BobLd merged 1 commit intomasterfrom
improve-numeric-tokenizer-performance

Conversation

@EliotJones
Copy link
Member

the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially

this ends up being somewhere between 2-3 times faster in my benchmarks on a subset of numeric data from PDF files:

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 13.684 us | 0.2333 us | 0.2182 us |
| PigNew |  5.963 us | 0.0990 us | 0.0877 us |

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 14.806 us | 0.2098 us | 0.1962 us |
| PigNew |  6.230 us | 0.1205 us | 0.1127 us |

Based on tracing for opening ~350 documents the tokenize method here was called approximately 16.2 million times so is definitely a hot path. Real world performance impact may be different.

the existing numeric tokenizer involved allocations and string parsing. since
the number formats in pdf files are fairly predictable we can improve this
substantially
@EliotJones EliotJones requested a review from BobLd July 25, 2025 02:21
@EliotJones
Copy link
Member Author

EliotJones commented Jul 25, 2025

@BobLd I'm planning a 0.1.11 full release as follows:

@BobLd
Copy link
Collaborator

BobLd commented Jul 25, 2025

@EliotJones thanks a lot for this change, looks great!

Regarding 0.1.11, we are on the same page - agreed with your plan.

@BobLd BobLd merged commit 85fc63d into master Jul 25, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants