Fix ephemeral cache parent path deduplication (INDEX-012)#3018
Fix ephemeral cache parent path deduplication (INDEX-012)#3018
Conversation
- Updated the `EphemeralIndexCache` to support indexing scope, allowing differentiation between current and recursive indexing. - Modified methods to create and mark indexing complete with scope parameters, improving cache management and preventing redundant scans. - Refactored path resolution logic in `EphemeralIndex` to handle symlinks and ensure accurate entry retrieval. - Adjusted related components to utilize the new indexing scope, enhancing overall indexing efficiency and accuracy.
PR SummaryMedium Risk Overview
Callers ( Written by Cursor Bugbot for commit fe7d1a5. Configure here. |
| .as_ref() | ||
| .map_or(false, |c| c.starts_with(existing_path)); | ||
|
|
||
| if covered { |
There was a problem hiding this comment.
One edge case: covered will also be true when path == existing_path, which would skip reindexing the same root. Might be worth treating only strict parents as “covered”.
| if covered { | |
| let is_same_path = path == *existing_path | |
| || canonical.as_ref().map_or(false, |c| c == existing_path); | |
| if covered && !is_same_path { |
| scope: IndexScope, | ||
| ) -> Arc<TokioRwLock<EphemeralIndex>> { | ||
| let indexed = self.indexed_paths.read(); | ||
| let canonical = path.canonicalize().ok(); |
There was a problem hiding this comment.
Minor perf nit: canonicalize() here is a filesystem hit and happens even when there are no recursive entries. You could compute it lazily inside the loop only when needed (or only when scope/existing entries make it relevant).
|
|
||
| // If this is a recursive scan, subsume child paths | ||
| if scope == IndexScope::Recursive { | ||
| indexed.retain(|existing, _| !existing.starts_with(path) || existing == path); |
There was a problem hiding this comment.
Symlink variant corner case: retain only checks starts_with(path), so a child key stored under the symlink form (/Users/...) won’t be subsumed when the recursive root is canonical (/System/Volumes/Data/...), and vice-versa. If you want this dedup to be robust, consider normalizing keys (store canonical form) before retain/insert for recursive scopes.
| // Preserve subdirectories that were explicitly browsed AND still exist | ||
| if child_node.is_directory() && indexed_paths.contains(&child_path) { | ||
| // Preserve subdirectories that were explicitly browsed AND still exist | ||
| if child_node.is_directory() && indexed_paths.contains_key(&child_path) { |
There was a problem hiding this comment.
Since indexed_paths is now scope-aware, I think this should probably only preserve explicitly browsed subdirs (Current), not any tracked key.
| if child_node.is_directory() && indexed_paths.contains_key(&child_path) { | |
| if child_node.is_directory() | |
| && matches!( | |
| indexed_paths.get(&child_path), | |
| Some(crate::ops::indexing::IndexScope::Current) | |
| ) | |
| { |
|
|
||
| // Preserve subdirectories that were explicitly browsed AND still exist | ||
| if child_node.is_directory() && indexed_paths.contains(&child_path) { | ||
| // Preserve subdirectories that were explicitly browsed AND still exist |
There was a problem hiding this comment.
This block looks mis-indented inside the filter_map closure (might just need a cargo fmt).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| ); | ||
| return self.index.clone(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Self-coverage in create_for_indexing breaks volume re-indexing
High Severity
The recursive parent coverage check in create_for_indexing uses path.starts_with(existing_path) which returns true when both paths are equal. When re-indexing an already recursively-indexed volume, this causes a silent early return — the path is never moved from indexed_paths to in_progress. But callers in action.rs unconditionally proceed to call clear_for_reindex (destroying existing data) and dispatch a new job. During re-indexing, the path still appears "indexed" (causing queries to attempt using emptied data) and has no is_indexing protection against concurrent duplicate scans.
Additional Locations (1)
| } | ||
| } | ||
|
|
||
| None |
There was a problem hiding this comment.
Duplicated path coverage logic across three methods
Medium Severity
get_for_path, is_indexed, and create_for_indexing each implement nearly identical path coverage logic: exact match → canonicalize → iterate recursive parents. This triplicated code already shows divergence from get_for_search which additionally canonicalizes the indexed paths — a difference that may be an unintended inconsistency. Extracting a shared helper like is_path_covered accepting the locked map would eliminate this duplication and risk of future drift.


Summary
Fixes INDEX-012: Prevents redundant ephemeral scans when browsing subdirectories of already-indexed volumes, particularly on macOS where symlinks cause path mismatches.
Problem
When a volume is recursively indexed (e.g.
/System/Volumes/Data), browsing a subdirectory like/Users/jamespinewould trigger a redundant shallow scan because:get_for_search()canonicalizes and finds the index ✅list_directory()does rawpath_index.get(path)lookup - arena has/System/Volumes/Data/Users/jamespinebut query is/Users/jamespine→ returnsNone❌create_for_indexing("/Users/jamespine"), spawns redundant jobThis created duplicate entries in
indexed_pathsand wasted work re-scanning already-indexed content.Changes
1. Symlink-aware path resolution in EphemeralIndex
resolve_entry_id()helper that tries raw path first (fast path), then canonicalizes (slow path)list_directory,get_entry_ref,get_entry_uuid, etc.) to use symlink-aware lookup/Users→/System/Volumes/Data/Users2. Scope-aware parent-path deduplication in EphemeralIndexCache
indexed_pathsfromHashSet<PathBuf>toHashMap<PathBuf, IndexScope>create_for_indexing()checks if path is already covered by recursive parent (with canonicalization)mark_indexing_complete()removes subsumed child paths when recursive scan completesis_indexed()andget_for_path()check recursive parent coverage with symlink resolution3. Updated callers to pass IndexScope
volumes/index/action.rs- passes volume indexing scopedirectory_listing.rs- passesIndexScope::Currentfor directory browsingjob.rs- passes job config scope4. Updated
clear_directory_childrensignatureHashSet<PathBuf>toHashMap<PathBuf, IndexScope>for consistencyTest Results
test_parent_path_coverage- recursive parent covers childrentest_shallow_browse_no_parent_coverage- Current scope doesn't cover childrentest_no_redundant_scan_under_volume- no-op for covered childtest_volume_subsumes_child_paths- child removal on recursive completiontest_symlink_path_resolution- symlink resolution on macOSImpact
After this fix:
indexed_pathswhen browsing under indexed volumes/Users→/System/Volumes/Data/Users)Related
Made with Cursor