Optimize performance of blocklist filtering and checking by using Regex#17
Optimize performance of blocklist filtering and checking by using Regex#174kimov merged 6 commits intosqids:mainfrom
Conversation
9559ba4 to
8b1d826
Compare
src/Sqids.php
Outdated
| if ($id == $word) { | ||
| return true; | ||
| } | ||
| } elseif (preg_match('/~[0-9]+~/', (string) $word)) { |
There was a problem hiding this comment.
This regex is not matching anything as the words never contain the tilde ~ char.
src/Sqids.php
Outdated
|
|
||
| protected MathInterface $math; | ||
|
|
||
| protected ?string $blocklist = null; |
There was a problem hiding this comment.
changing the type of a protected property (from array to ?string is a BC break. I don't know what is the Backward Compatiblity policy of this project (I got here because of your toot about performance improvements, by curiosity) but it might make sense to be care about this.
There was a problem hiding this comment.
You are right, I reverted to use an other property name and leave this one. Even if I don't see any reason this class would be extended. It should be final, it has an interface.
There was a problem hiding this comment.
You're right @stof, we don't want to introduce any breaking changes.
There was a problem hiding this comment.
If anyone extends the class and updates $this->blocklist after calling the constructor, then this change will not be used later in the id generator.
But this is already a misuse, as it bypasses the filter.
|
@GromNaN I gotta admit, I'm a bit confused by this PR. A few questions:
|
dc0198b to
67d951e
Compare
Removing the words invalid chars from the blocklist is an optimization, not a feature. Even after optimization (step 1), this task is more costly during class initialization, for an insignificant benefit when an ID is generated.
If I've missed anything, it's that it's not covered by the tests. My understanding is that you block any generated short ID that contains a blocked word. Whether the id "starts with", "ends with" or "is equals to" the word is covered by the "contains" verification done using the regex.
That was expected. I rebased the PR. |
|
You're right, the tests for blocklist logic should cover more scenarios. I'm not sure just the contains logic would mimic what the spec does. Here's one recent discussion about this: sqids/sqids-javascript#30 Another example is: only if the word is bigger than 3 chars and it contains numbers and it starts with or ends with blocked word, then we block it. So id like |
|
Currently, there is a Lines 818 to 819 in 9390a85 The regex match is always Lines 814 to 816 in 9390a85 This condition on id or word smaller that 3 is not tested. If you find a test, I could fix the code. |
|
Thanks for pointing out the PS: Updating tests for PHP is also not as straightforward since they're the same across all implementations. |
|
Added more tests here (explanation for the tests is in the spec). Also, changed the regex for ints as discussed previously. If anybody sees anything else worth testing in @GromNaN Could you run this PR against the new tests? |
ac5e827 to
e335cb0
Compare
|
Thanks for the tests. I updated the regex to use start |
|
Thank you @GromNaN, that was fast work :) Merged! |
|
It was a good collaboration. I think this implementation using regex can improve performance in other languages too, but I wouldn't do it. |
|
Indeed. I'm going to wait a few days before publishing the new version. I know we have a breaking change with that Thanks for all the contributions, it was a lot and much appreciated. I see you're a PHP expert. I gave you write access to the repo in case you notice something, and I'm too slow to respond (that way I'm not blocking you). |
A single call to
preg_matchcan replace a lot of lines of code, and is executed in optimized C code instead of PHP.1.
Blocklist filterThe regex/^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]+$/itell if all the characters of the blocked work are in the alphabet.(cancelled because I removed the blocklist filter in 3).
2. Check if the ID contains a blocked word
A regex is generated with all the blocked words
/(1d10t|b0ob)/iand run with in case-insensitive mode.3. Apply leet transformation to blocked word
The blocklist if full of leet variations of the same words. Using regex, we can check directly for alternative way of writting the same word.
/(ahole)/ibecomes/(ah[oO][l1]e)/ito checkah0le,aho1e,ah01eand all other case variations.Benchmark
PHPBench code
In
phpbench.json{ "$schema": "./vendor/phpbench/phpbench/phpbench.schema.json", "runner.bootstrap": "vendor/autoload.php", "runner.file_pattern": "*Bench.php", "runner.path": "tests", "runner.iterations": 3 }In
tests/SqidsBench.phpBefore
After 2
After 3
After rebase on #18