🛡️ Sentry: Harden hybrid search against NaN and DoS#1013
🛡️ Sentry: Harden hybrid search against NaN and DoS#1013
Conversation
- `src/query/hybrid.rs`: Added `MAX_HYBRID_K` limit (100,000) and filtered NaNs from `traverse_and_rank`. - `src/query/semantic_pathfinding.rs`: Handled NaN similarity in `calculate_semantic_cost` by using max cost. - Added regression tests for NaN handling and `k` limit enforcement. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Summary of ChangesHello @madmax983, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and security of the system's search and pathfinding functionalities. It introduces mechanisms to prevent denial-of-service attacks by limiting resource consumption and ensures stable operation when encountering invalid or corrupted vector data, specifically floating-point Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively hardens the hybrid search and semantic pathfinding functionalities against potential DoS attacks and data corruption issues. It introduces a limit on the k parameter in traverse_and_rank to prevent memory exhaustion and adds robust handling for NaN similarity scores in both traverse_and_rank and SemanticPathfinder. The accompanying tests are comprehensive and validate the new hardening measures. The feedback regarding duplicated logic for NaN handling in semantic_pathfinding.rs remains a valid point for refactoring to improve maintainability.
| let semantic_cost = if let Some(emb) = target_embedding { | ||
| 1.0 - cosine_similarity(emb, query_embedding)? | ||
| let sim = cosine_similarity(emb, query_embedding)?; | ||
| // SENTRY: Handle NaN similarity by assigning high cost | ||
| if sim.is_nan() { | ||
| 1.0 | ||
| } else { | ||
| 1.0 - sim | ||
| } | ||
| } else { | ||
| 1.0 // High cost if no embedding | ||
| }; |
There was a problem hiding this comment.
There's a duplication of logic for calculating semantic cost here and in the calculate_semantic_cost function. Both now handle NaN similarity scores by assigning a high cost, but they do so in separate implementations. This could lead to maintenance issues if one is updated and the other is forgotten.
To improve maintainability and reduce code duplication, consider refactoring this logic into a shared helper function. This function could take an Option< &[f32]> embedding and the query embedding, and return the calculated cost. Both find_path_at_time and calculate_semantic_cost could then call this helper.
For example:
fn calculate_semantic_cost_from_embedding(embedding: Option< &[f32]>, query: &[f32]) -> Result<f32> {
if let Some(emb) = embedding {
let sim = cosine_similarity(emb, query)?;
if sim.is_nan() {
Ok(1.0)
} else {
Ok(1.0 - sim)
}
} else {
Ok(1.0)
}
}Then find_path_at_time could be simplified to:
let semantic_cost = calculate_semantic_cost_from_embedding(target_embedding, query_embedding)?;And calculate_semantic_cost could also be refactored to use this helper.
🚀 Benchmark ResultsBenchmarks have been run for this PR. 📊 Performance Summary (Top 10)
Full benchmark results available in workflow artifacts |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## trunk #1013 +/- ##
==========================================
+ Coverage 92.03% 92.05% +0.02%
==========================================
Files 126 126
Lines 72697 72994 +297
==========================================
+ Hits 66904 67197 +293
- Misses 5793 5797 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ace42cb22f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if sim.is_nan() { | ||
| return Ok(1.0); |
There was a problem hiding this comment.
Validate query embeddings before masking NaN similarities
The new sim.is_nan() fallback in calculate_semantic_cost turns invalid query vectors into a normal cost (1.0) instead of surfacing an error. Because sim can be NaN when the caller-provided query contains NaN/Inf values, find_path (and the analogous temporal branch) can now return an apparently valid path for malformed input, producing misleading semantic results rather than failing fast. Please validate query_embedding at the API boundary (or return an error when similarity is NaN) so this guard only handles corrupted stored node embeddings.
Useful? React with 👍 / 👎.
Ran `cargo fmt` to resolve CI linting failures in `src/query/hybrid.rs` and `src/query/semantic_pathfinding.rs`. Verified with `cargo clippy`. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
🚀 Benchmark ResultsBenchmarks have been run for this PR. 📊 Performance Summary (Top 10)
Full benchmark results available in workflow artifacts |
Hardened the hybrid search (
traverse_and_rank) and semantic pathfinding logic against invalid vector data (NaNs) and excessive resource consumption.Changes:
traverse_and_ranknow validateskagainstMAX_HYBRID_K(100,000) to prevent OOM DoS.traverse_and_rankfilters out candidates withNaNsimilarity scores (which can occur from corrupted stored data) to prevent undefined sorting behavior.SemanticPathfinder::calculate_semantic_costnow returns a safe fallback cost (1.0) whencosine_similarityreturnsNaN, preventing priority queue corruption.src/query/hybrid.rsandsrc/query/semantic_pathfinding.rsverifying these fixes, including tests that bypass standard validation to simulate data corruption.PR created automatically by Jules for task 4029855929144920958 started by @madmax983