-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Summary
When a cached _dataset_index.json exists in a parent cache directory, it takes precedence over the correct split-specific index. This causes the wrong files to be processed.
Current Behavior
# User has:
cache/tusz_mmap/_dataset_index.json # Old global index with 7364 files
cache/tusz_mmap/eval/ # Empty, wants to build eval index
# Running build-cache for eval...
python -m src build-cache \
--data-dir data_ext4/tusz/edf/eval \
--cache-dir cache/tusz_mmap/eval
# PROBLEM: Code finds parent's index and uses it instead of building fresh!Root Cause
datasets.py:75-86 checks for _dataset_index.json but may find a stale one from a different scope:
index_cache_path = self.cache_dir / "_dataset_index.json"
if index_cache_path.exists():
# Loads cached index without verifying it matches current data_dir
cached_index = json.load(f)
cached_files = [Path(p).name for p in cached_index["files"]]
current_files = [p.name for p in self.edf_files]
if cached_files == current_files: # May match if filenames happen to overlap!
self._index_map = cached_index["index_map"]The issue is compounded by #5 - if --split doesn't scope correctly, and a global index exists, the wrong index is loaded.
Expected Behavior
- Index should be tied to the specific data directory, not just cache directory
- Should verify absolute paths, not just filenames
- Should rebuild if data_dir doesn't match
Workaround
Manually delete stale index before building:
rm cache/tusz_mmap/_dataset_index.json
python -m src build-cache ...Proposed Fix
- Include
data_dirhash in index cache key, OR - Store
data_dirin index and verify on load, OR - Use split-scoped index paths (e.g.,
_dataset_index_{split}.json)
Files
src/brain_brr/data/datasets.py:74-86- Index loading logic
Priority
High - Silent incorrect behavior, can cause training on wrong data
Related
- --split parameter in build-cache doesn't scope to correct subdirectory #5 (--split parameter confusion)
- build-cache CLI doesn't actually build NPY cache files #4 (build-cache doesn't build NPY)
Labels
bug, data-pipeline
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working