[feat](query_v2) Add PrefixQuery, PhrasePrefixQuery and UnionPostings support #60701
[feat](query_v2) Add PrefixQuery, PhrasePrefixQuery and UnionPostings support #60701zzzxl1993 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29932 ms |
TPC-DS: Total hot run time: 190898 ms |
ClickBench: Total hot run time: 28.51 s |
|
run buildall |
TPC-H: Total hot run time: 30923 ms |
TPC-DS: Total hot run time: 188542 ms |
ClickBench: Total hot run time: 28.34 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
75af3e8 to
b3d60df
Compare
|
run buildall |
TPC-H: Total hot run time: 28990 ms |
TPC-DS: Total hot run time: 185779 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| } | ||
|
|
||
| auto bit_set = std::make_shared<BitSetScorer>(doc_bitset); | ||
| auto const_score = std::make_shared<ConstScoreScorer<BitSetScorerPtr>>(std::move(bit_set)); |
There was a problem hiding this comment.
why use ConstScoreScorer here, not supported scoring?
There was a problem hiding this comment.
Prefix queries don't support scoring in the traditional sense. They expand to multiple terms (e.g., "pre" matches "prefix", "prepare", "present", etc.), and scoring each expanded term individually doesn't make semantic sense - the user is searching for a prefix pattern, not independent terms. That's why I use ConstScoreScorer to give all matching documents the same constant score. This is standard for prefix/fuzzy/wildcard queries.
| } | ||
| } | ||
|
|
||
| uint32_t advance() override { |
There was a problem hiding this comment.
Adding a priority-queue (min-heap) optimization for advance()?
There was a problem hiding this comment.
I considered using a priority queue. Based on our dataset testing, the current linear scan performs well for typical query patterns. The simplicity and cache-friendly nature of linear scan outweigh the theoretical O(log n) advantage of heaps in practice. If profiling shows this becomes a bottleneck in the future, we can optimize with a hybrid approach (e.g., use heap when n > threshold).
| } | ||
|
|
||
| // Only prefix term, no phrase terms — fall back to a plain prefix query. | ||
| PrefixQuery prefix_query(_context, std::move(_field), std::move(_prefix.value().second)); |
There was a problem hiding this comment.
The std::move is intentional here - both code paths in weight() (line 59 and line 80) consume these members and return immediately, so the object won't be used after this call. This avoids unnecessary copies of the string data. But I can change to copy if you think it's clearer.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)