Skip to content

Stale _dataset_index.json takes precedence over fresh index for split #6

@The-Obstacle-Is-The-Way

Description

Summary

When a cached _dataset_index.json exists in a parent cache directory, it takes precedence over the correct split-specific index. This causes the wrong files to be processed.

Current Behavior

# User has:
cache/tusz_mmap/_dataset_index.json  # Old global index with 7364 files
cache/tusz_mmap/eval/                 # Empty, wants to build eval index

# Running build-cache for eval...
python -m src build-cache \
  --data-dir data_ext4/tusz/edf/eval \
  --cache-dir cache/tusz_mmap/eval

# PROBLEM: Code finds parent's index and uses it instead of building fresh!

Root Cause

datasets.py:75-86 checks for _dataset_index.json but may find a stale one from a different scope:

index_cache_path = self.cache_dir / "_dataset_index.json"
if index_cache_path.exists():
    # Loads cached index without verifying it matches current data_dir
    cached_index = json.load(f)
    cached_files = [Path(p).name for p in cached_index["files"]]
    current_files = [p.name for p in self.edf_files]
    if cached_files == current_files:  # May match if filenames happen to overlap!
        self._index_map = cached_index["index_map"]

The issue is compounded by #5 - if --split doesn't scope correctly, and a global index exists, the wrong index is loaded.

Expected Behavior

  • Index should be tied to the specific data directory, not just cache directory
  • Should verify absolute paths, not just filenames
  • Should rebuild if data_dir doesn't match

Workaround

Manually delete stale index before building:

rm cache/tusz_mmap/_dataset_index.json
python -m src build-cache ...

Proposed Fix

  1. Include data_dir hash in index cache key, OR
  2. Store data_dir in index and verify on load, OR
  3. Use split-scoped index paths (e.g., _dataset_index_{split}.json)

Files

  • src/brain_brr/data/datasets.py:74-86 - Index loading logic

Priority

High - Silent incorrect behavior, can cause training on wrong data

Related

Labels

bug, data-pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions