-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I spent some time today benchmarking the library and playing with making a toy/useless Markdown parser with it, so here's some miscellaneous feedback after having interacted some more with the library, feel free to close this and perhaps open standalone issues for the parts you think are worth addressing.
For the Markdown parser thing I was trying to write a matcher that matched headings, and I had some problems with that:
- I wanted to get the leading hashes, trailing hashes, and content in between as individual strings in the tag to be transformed, that immediately implies that I have to use multiple regexes because capturing groups within a single regex are lost in the tag. Perhaps this should be changed somehow as that would be quite powerful.
- Not being able to use a single regex in this scenario means also that I can't use
\1,\2etc. to reference other capturing groups either, in my headings scenario the trailing hashes should really be considered trailing hashes only if they are exactly the same number as the leading hashes, otherwise they should be considered part of the body, this can't quite be expressed cleanly with the current system because the first capturing group/matcher can't be referenced.- Addressing the first issue would kind of address this too.
- Another option would be to extend the DSL adding support for
\[0-9]references, which in this case would mean referencing the 1st, 2nd... 9th whole sub-matcher. - Perhaps both options should be implemented, I'm not sure.
- Continuing on the same scenario there's another issue once the standalone regex gets broke up:
I forgot to take note of what the issue was exactly (🤦♂️), but unless I'm misremembering the issue is that those two expressions don't quite match the same things because the lazy modifier on the broken-up version doesn't behave the same way basically.
Standalone: `${/(#{1,6} )(.+?)( #{1,6})/}` Broken-up: `${/#{1,6} /} ${/.+?/} ${/ #{1,6}/}` - Custom regex flags are discarded, it would be nice in some scenarios to be able to write a case-insensitive regex or a regex where "^" and "$" match the start and end of the line respectively for example.
- The DSL to me looks kind of like a reimplementation of a subset of the regex language, so perhaps it should be extended a bit to match it more closely, for example how-many-modifiers (what are these things actually called?) like
{1,3}perhaps should be supported too.
Now about performance, perhaps the more interesting part of the feedback.
From what I've seen every ~atomic thing the library does is pretty fast, so there shouldn't be any meaningful micro-optimizations available, the root issue seems to be actually that the library spends too much times on the wrong alternations.
Some actual real numbers first so that the rest of the feedback sounds less crazy:
- I've been benchmarking the library with this.zip. Basically there's a parser that matches against a subset of javascript and it is asked to match a bunch of expressions in a loop.
- Making this 14-characters diff to a single matcher of the parser cut the time it took to run the benchmark by ~40%:
- = $`${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Slow + = $`${/(?=.*\?)/} ${LogicalORExpression} ${_} ${TernaryOperatorTrue} ${Expression} ${TernaryOperatorFalse} ${Expression}`; // Fast
- Basically this is what's happening:
- This rule matches a ternary expression.
- Most expressions in the benchmark aren't ternary expressions.
- Still those expressions could be the boolean condition of a ternary expression.
- So the parser parses the entire thing, until it realizes the required "?" character for the ternary expression can't be found.
- So it backtracks and eventually matches the entire thing again with another matcher.
- From the "again" word here it comes the almost doubled performance because of the changed rule.
- The only thing the changed rule does is checking if the "?" character is present before running any other gazillion matchers.
That's kind of the root of the performance problems with RegHex parsers in my opinion, if I had to guess with enough sophistication perhaps some parsers could become 100x faster or more just by going down branches/alternations more intelligently.
At a high-level to me RegHex parsers look kinda like CPUs, individual patterns are like instructions, each alternation is a branch etc. it should follow then that the same optimizations used for CPUs could/should be used for RegHex. I know next to nothing about that really, but just to mention some potential things that crossed my mind:
- In my ternary expression matcher above there are a series of instructions that should succeed for the entire thing to be considered a match, but not all instructions are the same performance-wise, e.g. checking if the string to match contains a "?" is waaaaay faster than checking if the string starts with something that could be a boolean expression for the ternary operator, the fastest checks should be performed first.
- This optimization specifically could be performed at the parser-level with a lookahead, that works but that's kinda ugly.
- Another, more general, approach would be to analyze regexes, probably at build-time so that how long it takes to do that doesn't matter, and extract subsets of the automata that must match under every branch and are easy to check for, like the presence of "?" and ":", in that order, in the input string required by my ternary expression matcher.
- Next perhaps a branch predictor could be added, the order in which alternations are tried matters for performance, and if the 9th alternation in a set of 10 alternation is the one that matched the most in the past perhaps that should be tried first and most of the times we can skip failing to match the first 8 alternations altogether.
- This could be pretty tricky to optimize for automatically, because you need to know for which chunks in the alternations array the order doesn't matter. Maybe some API could be exposed to the user and just move the problem to the user, like a "isParallelizable" third argument to the match function or something.
- This is kinda out-there but in some sense RegHex took the individual automata I wrote (~the matchers) and built a larger one by putting the smaller ones in a tree (~the final parser), now currently what happens if I don't add the lookahead for "?" is that the parser goes down the branch for the ternary expression, matches a whole lot of stuff, then realizes the "?" character can't be found and it goes back to the starting point, taking another branch - now it might be interesting to note that there are multiple branches of the tree here that look exactly the same for a while, so they should kind of get merged together in a prefix tree, this way once the ternary expression wouldn't ultimately get matched RegHex wouldn't go back to the very root of the tree but it would take the nearest unexplored branch first instead, which in this particular case would lead to finding a match almost immediately (e.g. "so far that was an OR expression -> there are no more characters left in the input string -> done").
Depending on how many of these fancy optimizations you are willing to spend time on perhaps a seriously fast JS parser could be written on top of RegHex 🤔 that'd be really cool.
Sorry for the long post, hopefully there's some useful feedback in here.