Main 5.0.6 rebase sai fixes#2251
Open
adelapena wants to merge 1569 commits intomck/main-5.0.6-rebasefrom
Open
Conversation
…nTest (#1664) ### What is the issue `PendingAntiCompactionTest.testRetriesTimeout` is flaky on `main` failing with a timeout reported in the junit test task. riptano/cndb#13524 ### What does this PR fix and why was it fixed `PendingAntiCompactionTest.testRetriesTimeout`, and a couple other tests, are calling `Future.get` without any timeout. This change adds a 30 seconds timeout to prevent an indefinite wait for the future and a log in the failing test rather than in the junit test task. (Rebase of commit c11b523)
(Rebase of commit f87d214)
(Rebase of commit a48eca9)
…tricsTest (Rebase of commit 5d98a96)
(Rebase of commit 12f9fa9)
### What is the issue This is a follow up test for riptano/cndb#13696. ### What does this PR fix and why was it fixed The fix was already merged, but this confirms that without the fix, collections are affected by the NoSuchElementException bug. (Rebase of commit 88f08e2)
Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789 (Rebase of commit 14cbbc6)
### What is the issue Didn't create an issue as this is just a new test. ### What does this PR fix and why was it fixed We implicitly rely on this and are exposing this as a feature to our end users, so we want to confirm that the stop words are the expected stop words. (Rebase of commit 27ac572)
Fixes: riptano/cndb#13718 We had adopted this version of lucene datastax/lucene@5ea8bb4 in order to support our custom modifications of HNSW on top of lucene. We now use https://github.com/datastax/jvector for vector search and no longer need a custom build. I propose we use 9.8.0 since that is the closest release to the one we have been using. CNDB test pr: riptano/cndb#13757 (Rebase of commit 3d5d7d6)
This introduces a `PrefetchingRebufferer` that, when enabled, prefetches (read from the underlying reader) a configurable number of chunks in advance (so only work on top of rebufferer factories that work with chunks, i.e. not uncompressed mmap). Prefetching must first be enabled globally by setting `-Dcassandra.read_prefetching_size_kb` to the desired value. With that, and assuming the disk access mode allows it (meaning, it's not uncompressed mmap), then prefetching will be applied to reads that are "clearly" sequential. Mostly, as of this patch, this means the sstable "scanners" and `SortedStringTableCursor`, so compaction, SAI index building and tools that read sstable fully (scrub, verifier) will benefit from it. Since rebufferers are synchronous, prefetching is done through a fixed thread pool. The number of thread of that pool can be set with `-Dcassandra.read_prefetching_threads` (but default to the number of "processors"). The `-Dcassandra.read_prefetching_window` can also be set to define how often prefetching is re-triggered. By default, when there is less than half of the value of `-Dcassandra.read_prefetching_size_kb` prefetched, then prefetching is triggered. See riptano/cndb#13639 for additional motivation. I'll note that this patch is adapted from the similar DSE behavior. However, there is a fair bit of modification since the DSE version was relying on the asynchronous nature of rebufferers there (it also had a few behavior I didn't fully understood and I didn't dig when it felt a bit specific to DSE). One point worth mentioning was that the DSE version was relying on the ability of the underlying channel to create batches when multiple chunks are prefetched. This is something that could have advantages and we could try to add back over time, but it's a lot harder to see how to do that in the non-async world of C*. Typically, in DSE, since all the prefetches were initiated on the current thread (without blocking it), building the "batch" was relatively easy, but it doesn't translate in this patch. Not to mention that currently the underlying channels have no batching options and those would have to be added (in a way that trickled down to the channel implementation within CNDB). Anyway, as of this patch, chunks are prefetched in parallel but with no batching optimization. Which does mean the `window' parameter is probably not as useful as for DSE, but I've kept it nonetheless as it's not exactly adding complexity. In the context of compaction in CNDB though, I think this mean that it would kind of be ideal if we could use a relatively large "chunk size" for the readers: we read the full file anyway, so there is no waste there to use a large-ish chunk size, and it would kind of provide batching for prefetching "for free" (in the sense that if we want to prefetch, say, 128kb in advance, then we only need 2 chunks with a 64kb chunk, but need a lot more with a 4kb chunk size). --------- Co-authored-by: Branimir Lambov <branimir.lambov@datastax.com> (Rebase of commit c57974a)
…intervals in Interval.AsymetricOrdering are Comparable and the SSTableReader mock in BaseCompactionStrategyTest violated the Comparator general contract, raising IllegalArgumentException: Comparison method violates its general contract! The root cause was that the objects being compared in Interval.minOrdering were not the same object using '==' but the i1.data.compareTo(i2.data) result was 0, thus violating the comparator contract. Copilot analysis was that Mockito mocks do not have a meaningful compareTo implementation, so their comparison is not stable or meaningful. This fix changes BaseCompactionStrategyTest.mockSSTable so that getId returns unique values and provides an explicit comparator that compares using getId. (Rebase of commit 390a5ce)
…DRA-20189, but due to rebase conflicts the implementation was overwritten by CC SAI code. (Rebase of commit ce68642)
…ASSANDRA-20100, but due to rebase conflicts the implementation was overwritten by CC SAI code. (Rebase of commit 655714f)
(Rebase of commit da9aab6)
(Rebase of commit d122c73)
…essions, changing default value of StorageAttachedIndexOptions.prioritize_over_legacy_index to 'true'. The prioritize_over_legacy_index was introduced in 5.0.4 by CASSANDRA-20334 using a default of 'false' (Rebase of commit d5896b9)
…1707) ### What is the issue Failing `EarlyOpenCachingTest` due to incorrect assignment of data file length. ### What does this PR fix and why was it fixed Passes the correct lengths to the early open sstable construction. This lets `SimpleSSTableScanner` correctly finish iteration. This was already fixed in CC (as part of CNDB-9104). (Rebase of commit 6e759fa)
…ild heap (#1693) ### What is the issue Fixes: riptano/cndb#13689 ### What does this PR fix and why was it fixed This PR utilizes the NodeQueue::pushMany method to decrease the time complexity required to build the NodeQueue from `O(n log(n))` to `O(n)`. This is likely only significant for sufficiently large hybrid queries. For example, we have seen cases of the search producing 400k rows, which means that we do 400k insertions into these NodeQueue objects. (Rebase of commit 0f7e07b)
Fixes riptano/cndb#13822 CNDB test pr riptano/cndb#13826 SAI’s DiskANN/JVector engine currently **always searches with pruning enabled**, trading recall for latency. That is fine for most workloads, but: * **Threshold / bounded ANN queries** can lose matches when pruning exits early. * Performance‑testing users need an easy way to turn pruning off to measure the recall/latency curve. This patch introduces a per‑query **`use_pruning`** option so users and operators can choose the trade‑off that suits them. --- ```cql WITH ann_options = {'use_pruning': true|false} ``` *When omitted we fall back to the node level default (see below).* * Cluster‑wide default is controlled by the JVM system property: ``` -Dcassandra.sai.jvector.use_pruning_default=<true|false> ``` exposed as `V3OnDiskFormat.JVECTOR_USE_PRUNING_DEFAULT`. * The property defaults to `true`, preserving existing behaviour. * Value must be the literal `true` or `false` (case‑insensitive). * Unknown ANN option keys continue to raise `InvalidRequestException`. * `Orderer` computes the value of the `usePruning` option by using the `use_pruning` value if it is not null or the jvm default if it is null and passes it down to **all** `graph.search()` calls. * Threshold / bounded ANN queries always pass `use_pruning = false` because correctness > latency for those paths (this is a net new change, but it's very minor and might not have any impact on those queries depending on the jvector implementation) * We added one flag bit to `ANNOptions` serialization; older nodes ignore unknown bits, so mixed‑version clusters are fine (though they do throw an exception for unknown settings) * Parsing, validation and transport round‑trips (`ANNOptionsTest`). * Distributed smoke (`ANNOptionsDistributedTest`). * Recall regression for pruning vs no‑pruning (`VectorSiftSmallTest.ensureDisablingPruningIncreasesRecall`). (Rebase of commit 23be2e7)
(Rebase of commit 47f2c2a)
…1705) ### What is the issue Fixes: riptano/cndb#13625 ### What does this PR fix and why was it fixed When `rerank_k` is non-positive, we will skip reranking where possible. A minor note: this change is not strictly backwards compatible because of the original implementation. However, it will fail quickly if the coordinator accepts the value but the writer does not, and since this is only a query-time parameter, I propose that we move forward and accept the net-new value that was previously disallowed. (Rebase of commit 717971d)
…ation (#1709) Fixes: riptano/cndb#13847 We already make use of the `Token#getLongValue` method in SAI, which only works for the Murmur partitioner, so I propose that we add a symmetric method that allows us to create tokens without converting a long to a byte buffer back to a long. (Rebase of commit 68a92da)
The number of rows stored in SSTable's SAI index was incorrect for analyzed indexes and for indexes on collections as it was counting posting lists for each term. In non-analyzed or non-collection indexes there is a term per row, while in analyzed index there can be many terms in a row. This led to incorrect count of rows. Fixes counting number of rows stored in SSTable SAI index. Minor fixes of code warnings in the affected files. (Rebase of commit 29ef34e)
The test for KD-tree posting list was meant to run a query that performs skips and then verifies if the skip metric has been bumped up. Unfortunately there were several problems with it: * the query didn't do skips because the clause intersection got optimized out by the SAI optimizer * the test for the metric didn't inspect the metric value, but only checked if the metric was updated; so essentially it boiled down to checking if the index was used * the metric value incorrectly recorded queries with no skips, because the histogram was configured to not allow zeroes. In addition, the usage of index is dependent on the version of the optimizer, so that made the test behave differently between DC and EC index version. Now that we disable the optimizer for this test, that difference is gone. Fixes riptano/cndb#13830 (Rebase of commit 17b4ad7)
The new request failure reason should be used when the on-disk format of an index in a replica doesn't support a requested feature. Also, rename org.apache.cassandra.index.sai.disk.format.Version.latest() to Version.current(), and add separate Version.CURRENT and Version.LATEST constants. (Rebase of commit 523435a)
(Rebase of commit dce2e0c)
(Rebase of commit 895ade3)
…clustering columns in descending order - port CASSANDRA-20295 (Rebase of commit f10abed)
…llback (#2124) riptano/cndb#15990 - What: Eliminates unnecessary array allocations in PathUtils.atomicMoveWithFallback() by pre-allocating CopyOption[] arrays as static constants. - Why: JFR profiling showed this method was a memory allocation hotspot during RemoteFileCache preloading. Each call to Files.move() with varargs parameters created new arrays, causing excessive GC pressure. (Rebase of commit e054ecf)
…ting lists (#2112) Fixes: riptano/cndb#15919 Test PR: riptano/cndb#15949 In the original implementation for #820, we introduced the `PrimaryKeyMapIterator` to iterate all primary keys in an sstable and then do an anti-join on the result of an equality query. That design works, but requires some additional reads from disk to get primary keys that are unnecessary. There are two possible solutions: 1. We can use row ids (either sstable or segment) to do the complement of the resulting posting lists. This will be the most performant, since it avoids object allocations. The main issue with this solution is that it is much more complicated to implement and had unaddressed edge cases. 2. We can use the `primaryKeyFromRowId` that takes primary key bounds and then uses a row id, when rows are from the same sstable. This will be worse that solution 1 because it creates an object per key and requires comparing sstable ids before comparing sstable row ids, but it is a significant improvement over the current solution, which hits disk to load the primary key. When testing on my local machine and reviewing the JMH benchmarks, I can see that the current solution is about 16x worse than the minimum solution (2) and 32x worse than the optimal (1) solution. Given that the benchmarks in question are highly specific to the use case, I do no think we have sufficient motivation to introduce the exceedingly complex (1) solution. Note that the ideal solution to 1, that would have much less complexity, is to convert posting lists into a single iterator of sstable row ids, and then to take the complement of them. (Rebase of commit 9dd4c28)
…ility on all coordinators. Rename ambiguous waitForIndexqueryable() to waitForIndexQueryableOnFirstNode() and add coordinator-aware alternatives (#2127) (Rebase of commit 5d94ed4) ninjafix – CNDB-16177: CNDB-15968: Fix flaky SAI tests by ensuring index queryability on all coordinators. Rename ambiguous waitForIndexqueryable() to waitForIndexQueryableOnFirstNode() and add coordinator-aw… squash to a74ea99 / 5d94ed4
Adds coordinator CAS read/write latencies (Rebase of commit c32e5af)
…#2176) ### What is the issue After CNDB-16249, hints are written with the storage-compatible messaging version (e.g., VERSION_40), but dispatch was still using the peer's negotiated version (VERSION_DS_20). This mismatch forced hints to be decoded and re-encoded on dispatch, which fails when the tenant is unassigned because tables are unknown. ### What does this PR fix and why was it fixed Cap the dispatch version to the storage compatibility mode's version so hints can use the encoded path (no deserialization) when the versions match. (Rebase of commit 7d8f28b)
…compatibility mode (#2179) ### What is the issue Fix version mismatch error when dispatching encoded hints in storage compatibility mode ### What does this PR fix and why was it fixed Remove version checks in HintMessage.Encoded serialization that prevented dispatching hints written at storage compatibility version (e.g., 12) when the peer connection uses a different messaging version (e.g., 110). UUID and VInt serialization are version-independent, and hint bytes are written verbatim, so the version check was unnecessarily restrictive. (Rebase of commit c36fbcf)
ULID-based SSTable ID generation can fail with an NPE when generating a new ID. The root cause is that the underlying ULID generator can generate an empty Optional when the clock is moved backwards to before the previously generated ID or in certain rare overflow conditions when timestamp collides. If it's our first time through the generation loop, we prematurely exit with a null newVal. Top of the error stack: ``` java.lang.NullPointerException at org.apache.cassandra.utils.TimeUUID.approximateFromULID(TimeUUID.java:58) at org.apache.cassandra.io.sstable.ULIDBasedSSTableId.<init>(ULIDBasedSSTableId.java:52) at org.apache.cassandra.io.sstable.ULIDBasedSSTableId$Builder.lambda$generator$0(ULIDBasedSSTableId.java:129) ``` This can cause a flush to fail. Continue looping until newVal gets a value. The loop can spin until the corrected time catches up to the time of the most recently used ULID generation ID. This should be a short duration in a healthy cluster without large time corrections from sync. Tests are added in ULIDBasedSSTableIdGeneratorTest A package-protected constructor is introduced for ULIDBasedSSTableIdGeneratorTest.testGeneratorRetryOnEmptyOptional() Cassandra Applicability: upstream doesn't have ULIDBasedSSTableId (and won't because CASSANDRA-17048). (Rebase of commit e9afd3a)
…ly (#2186) ### What is the issue FSReadError extends IOError, so it was not caught by the catch block in filterCommitLogFiles(). This caused replayer failures when a remote commit log file was listed but no longer accessible. ### What does this PR fix and why was it fixed Only NoSuchFileException is handled gracefully; other IOErrors (such as corruption) are re-thrown to fail the replay as intended. (Rebase of commit a51cdcc)
(Rebase of commit 3dccb8c)
…d check (#2213) The NoSpamLogger clock override in was checking that accesses came from a single expected thread. This failed when internal threads accessed it during truncate in cleanupRepairTables(). Removed the thread check since the purpose is deterministic time, not thread enforcement. (Rebase of commit 94e3546)
PartitionAwarePrimaryKeyMap implements overcomplicated `ceiling` method calling `exactRowIdOrInvertedCeiling`. This commit Simplifies PartitionAwarePrimaryKeyMap.ceiling to use the corresponding correct method from the reader directly. This can be seen as a follow up to https://github.com/datastax/cassandra/pull/1096/files#diff-c5011580ab9b0d99d9e504570c4cccb152221d3dbe62c8a956e83fce9070b380 Rebase notes: - commit 48b55fb was dropped because it only contains lint/refactor changes and should instead be upstreamed (Rebase of commit e5c5d07)
The rationale behind this upgrade is to remediate CVE-2025-67735. The BoringSSL has been upgraded as well for consistency with 4.1.130.Final. CNDB PR: riptano/cndb#16481 (Rebase of commit 2e251a1)
The current version of `lz4-java` (1.8.0) has known vulnerabilities, namely [CVE-2025-12183](https://www.cve.org/CVERecord?id=CVE-2025-12183) [CVE-2025-66566](https://www.cve.org/CVERecord?id=CVE-2025-66566) This patch updates the lz4-java library from 1.8.0 to 1.10.2 As of lz4-java 1.8.1, the artifact coordinates changed from `org.lz4` to `at.yawk.lz4`, so this patch also updates the artifact coordinates (Rebase of commit af4abfb)
… wrong endianness to be squashed into XXX
Checklist before you submit for review
|
Member
|
CNDB-16593 → https://github.com/riptano/cndb/issues/16593 |
Member
|
cherry-picked into @adelapena , feel free to add commits if/as they come |
|
63d841e to
32e9a3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What is the issue
Multiple SAI tests are failing, likely due to rebase issues.
What does this PR fix and why was it fixed
Fix SAI tests failing due to rebase issues.