Skip to content

Fix ephemeral cache parent path deduplication (INDEX-012)#3018

Open
jamiepine wants to merge 1 commit intomainfrom
ephemeral-cache-dedup
Open

Fix ephemeral cache parent path deduplication (INDEX-012)#3018
jamiepine wants to merge 1 commit intomainfrom
ephemeral-cache-dedup

Conversation

@jamiepine
Copy link
Member

Summary

Fixes INDEX-012: Prevents redundant ephemeral scans when browsing subdirectories of already-indexed volumes, particularly on macOS where symlinks cause path mismatches.

Problem

When a volume is recursively indexed (e.g. /System/Volumes/Data), browsing a subdirectory like /Users/jamespine would trigger a redundant shallow scan because:

  1. get_for_search() canonicalizes and finds the index ✅
  2. list_directory() does raw path_index.get(path) lookup - arena has /System/Volumes/Data/Users/jamespine but query is /Users/jamespine → returns None
  3. Code falls through, calls create_for_indexing("/Users/jamespine"), spawns redundant job

This created duplicate entries in indexed_paths and wasted work re-scanning already-indexed content.

Changes

1. Symlink-aware path resolution in EphemeralIndex

  • Added resolve_entry_id() helper that tries raw path first (fast path), then canonicalizes (slow path)
  • Updated all read-path methods (list_directory, get_entry_ref, get_entry_uuid, etc.) to use symlink-aware lookup
  • Resolves macOS symlink issue where /Users/System/Volumes/Data/Users

2. Scope-aware parent-path deduplication in EphemeralIndexCache

  • Changed indexed_paths from HashSet<PathBuf> to HashMap<PathBuf, IndexScope>
  • Tracks whether each path was indexed recursively or just current-level
  • create_for_indexing() checks if path is already covered by recursive parent (with canonicalization)
  • mark_indexing_complete() removes subsumed child paths when recursive scan completes
  • is_indexed() and get_for_path() check recursive parent coverage with symlink resolution

3. Updated callers to pass IndexScope

  • volumes/index/action.rs - passes volume indexing scope
  • directory_listing.rs - passes IndexScope::Current for directory browsing
  • job.rs - passes job config scope

4. Updated clear_directory_children signature

  • Changed from HashSet<PathBuf> to HashMap<PathBuf, IndexScope> for consistency

Test Results

  • 12/12 tests passed in ephemeral cache test suite
  • No linter errors detected
  • Clippy passed with no warnings
  • ✅ Added 5 new tests:
    • test_parent_path_coverage - recursive parent covers children
    • test_shallow_browse_no_parent_coverage - Current scope doesn't cover children
    • test_no_redundant_scan_under_volume - no-op for covered child
    • test_volume_subsumes_child_paths - child removal on recursive completion
    • test_symlink_path_resolution - symlink resolution on macOS

Impact

After this fix:

  • No duplicate entries in indexed_paths when browsing under indexed volumes
  • No redundant scans triggered by symlink path mismatches
  • Properly distinguishes between shallow (Current) and recursive indexing scope
  • Works correctly on macOS with APFS symlinks (/Users/System/Volumes/Data/Users)

Related

  • Task: INDEX-012
  • Parent: INDEX-000 (Hybrid Indexing Architecture)

Made with Cursor

- Updated the `EphemeralIndexCache` to support indexing scope, allowing differentiation between current and recursive indexing.
- Modified methods to create and mark indexing complete with scope parameters, improving cache management and preventing redundant scans.
- Refactored path resolution logic in `EphemeralIndex` to handle symlinks and ensure accurate entry retrieval.
- Adjusted related components to utilize the new indexing scope, enhancing overall indexing efficiency and accuracy.
@cursor
Copy link

cursor bot commented Feb 7, 2026

PR Summary

Medium Risk
Touches core ephemeral indexing cache semantics (path coverage, canonicalization, and job completion bookkeeping), which could impact directory listing freshness and when indexing jobs are dispatched.

Overview
Fixes redundant ephemeral indexing under already-recursively-indexed roots by making the ephemeral index and cache scope- and symlink-aware.

EphemeralIndexCache now tracks indexed_paths as path -> IndexScope, treats children of Recursive paths as covered (including canonical/symlink forms), skips create_for_indexing() when a recursive parent already covers the target, and subsumes/removes redundant child entries when a recursive scan completes. EphemeralIndex adds symlink-aware path lookups via resolve_entry_id() and routes directory listing / entry reads through it so queries like macOS /Users/... hit entries stored under /System/Volumes/Data/....

Callers (directory_listing, volume indexing action, and indexer job completion) are updated to pass the correct IndexScope, and cache tests are expanded to cover parent coverage, shallow vs recursive behavior, deduplication, subsumption, and macOS symlink resolution.

Written by Cursor Bugbot for commit fe7d1a5. Configure here.

.as_ref()
.map_or(false, |c| c.starts_with(existing_path));

if covered {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case: covered will also be true when path == existing_path, which would skip reindexing the same root. Might be worth treating only strict parents as “covered”.

Suggested change
if covered {
let is_same_path = path == *existing_path
|| canonical.as_ref().map_or(false, |c| c == existing_path);
if covered && !is_same_path {

scope: IndexScope,
) -> Arc<TokioRwLock<EphemeralIndex>> {
let indexed = self.indexed_paths.read();
let canonical = path.canonicalize().ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor perf nit: canonicalize() here is a filesystem hit and happens even when there are no recursive entries. You could compute it lazily inside the loop only when needed (or only when scope/existing entries make it relevant).


// If this is a recursive scan, subsume child paths
if scope == IndexScope::Recursive {
indexed.retain(|existing, _| !existing.starts_with(path) || existing == path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlink variant corner case: retain only checks starts_with(path), so a child key stored under the symlink form (/Users/...) won’t be subsumed when the recursive root is canonical (/System/Volumes/Data/...), and vice-versa. If you want this dedup to be robust, consider normalizing keys (store canonical form) before retain/insert for recursive scopes.

// Preserve subdirectories that were explicitly browsed AND still exist
if child_node.is_directory() && indexed_paths.contains(&child_path) {
// Preserve subdirectories that were explicitly browsed AND still exist
if child_node.is_directory() && indexed_paths.contains_key(&child_path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since indexed_paths is now scope-aware, I think this should probably only preserve explicitly browsed subdirs (Current), not any tracked key.

Suggested change
if child_node.is_directory() && indexed_paths.contains_key(&child_path) {
if child_node.is_directory()
&& matches!(
indexed_paths.get(&child_path),
Some(crate::ops::indexing::IndexScope::Current)
)
{


// Preserve subdirectories that were explicitly browsed AND still exist
if child_node.is_directory() && indexed_paths.contains(&child_path) {
// Preserve subdirectories that were explicitly browsed AND still exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block looks mis-indented inside the filter_map closure (might just need a cargo fmt).

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

);
return self.index.clone();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-coverage in create_for_indexing breaks volume re-indexing

High Severity

The recursive parent coverage check in create_for_indexing uses path.starts_with(existing_path) which returns true when both paths are equal. When re-indexing an already recursively-indexed volume, this causes a silent early return — the path is never moved from indexed_paths to in_progress. But callers in action.rs unconditionally proceed to call clear_for_reindex (destroying existing data) and dispatch a new job. During re-indexing, the path still appears "indexed" (causing queries to attempt using emptied data) and has no is_indexing protection against concurrent duplicate scans.

Additional Locations (1)

Fix in Cursor Fix in Web

}
}

None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated path coverage logic across three methods

Medium Severity

get_for_path, is_indexed, and create_for_indexing each implement nearly identical path coverage logic: exact match → canonicalize → iterate recursive parents. This triplicated code already shows divergence from get_for_search which additionally canonicalizes the indexed paths — a difference that may be an unintended inconsistency. Extracting a shared helper like is_path_covered accepting the locked map would eliminate this duplication and risk of future drift.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant