Skip to content

feat: add LazyCategoricalDtype for lazy categorical columns#2288

Open
katosh wants to merge 21 commits intoscverse:mainfrom
settylab:feat/lazy-categorical-dtype
Open

feat: add LazyCategoricalDtype for lazy categorical columns#2288
katosh wants to merge 21 commits intoscverse:mainfrom
settylab:feat/lazy-categorical-dtype

Conversation

@katosh
Copy link
Contributor

@katosh katosh commented Jan 8, 2026

feat: add LazyCategoricalDtype for lazy categorical columns

Summary

Add LazyCategoricalDtype extending pd.CategoricalDtype with lazy loading support for categorical columns in lazy AnnData. This enables efficient access to categorical metadata without loading all categories into memory.

lazy_adata = ad.experimental.read_lazy("large_dataset.h5ad")
dtype = lazy_adata.obs["cell_type"].dtype  # LazyCategoricalDtype

# Cheap metadata access (no I/O)
dtype.n_categories     # 100000
dtype.ordered          # False

# Partial reads (efficient)
dtype.head_categories()     # first 5 categories
dtype.head_categories(10)   # first 10 categories
dtype.tail_categories()     # last 5 categories
dtype.tail_categories(10)   # last 10 categories

# Full load (cached after first access)
dtype.categories       # pd.Index with all categories

Motivation

When working with lazy AnnData objects containing many categories (e.g., 100k+ cell IDs as categories), loading all categories just to display a preview or check metadata is inefficient. This is particularly important for:

  1. repr/HTML display - showing category info without triggering full loads
  2. Data exploration - quickly inspecting category names
  3. Memory efficiency - avoiding unnecessary allocations

API Design

Following Ilan's suggestion, the API uses familiar pandas naming conventions:

Property/Method Returns Behavior
.categories pd.Index Full load, cached (standard pandas)
.ordered bool Standard pandas
.n_categories int Cheap metadata access
.head_categories(n=5) np.ndarray First n categories (partial read)
.tail_categories(n=5) np.ndarray Last n categories (partial read)

The head/tail naming follows pandas DataFrame.head()/DataFrame.tail() conventions.

Implementation Details

  • LazyCategoricalDtype extends pd.CategoricalDtype to maintain compatibility
  • Categories are loaded lazily on first .categories access and cached
  • head_categories/tail_categories use read_elem_partial for efficient partial reads
  • Works with both zarr and h5ad backends

Benchmark Results

Tested with 100k categories (median of 5 runs):

Method H5AD Zarr
n_categories 0.05 ms 0.11 ms
head_categories(10) 0.19 ms 8.82 ms
categories (full) 30.32 ms 19.19 ms

Speedups vs full load:

Method H5AD Zarr
n_categories 621x 168x
head_categories(10) 160x 2.2x

Note: zarr speedup for partial reads is limited because categories are currently written without explicit chunking.

@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 97.05882% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.64%. Comparing base (4376302) to head (edb04fc).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/anndata/experimental/backed/_lazy_arrays.py 97.05% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2288      +/-   ##
==========================================
- Coverage   86.74%   84.64%   -2.10%     
==========================================
  Files          46       46              
  Lines        7204     7289      +85     
==========================================
- Hits         6249     6170      -79     
- Misses        955     1119     +164     
Files with missing lines Coverage Δ
src/anndata/experimental/backed/_lazy_arrays.py 93.67% <97.05%> (+1.94%) ⬆️

... and 11 files with indirect coverage changes

katosh added 3 commits January 8, 2026 13:29
The merge code checks `dtype == "category"` which requires
LazyCategoricalDtype to handle string comparison in __eq__.
…-dtype

# Conflicts:
#	src/anndata/experimental/backed/_lazy_arrays.py
#	tests/lazy/test_read.py
@katosh katosh marked this pull request as ready for review January 8, 2026 15:08
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good!


arr = self._get_categories_array()
total = self.n_categories
return read_elem_partial(arr, indices=slice(0, min(n, total)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If arr is just a {H5,Zarr}Array, just use their raw slicing methods

Suggested change
return read_elem_partial(arr, indices=slice(0, min(n, total)))
return arr[0:min(n, total))]

Copy link
Contributor Author

@katosh katosh Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check that this is what you want: Raw slicing can end up returning encoded byte strings while users might expect to receive the decoded strings.

import h5py
import tempfile
import numpy as np
from anndata._io.specs.registry import read_elem_partial

# Create HDF5 file with string data
with tempfile.NamedTemporaryFile(suffix='.h5') as f:
    with h5py.File(f.name, 'w') as h5:
        # Store strings (HDF5 stores as bytes internally)
        h5.create_dataset('categories', data=['Cat_000', 'Cat_001', 'Cat_002'])

    with h5py.File(f.name, 'r') as h5:
        arr = h5['categories']

        # DIRECT SLICING: Returns bytes
        direct_result = arr[:2]
        print(f"Direct slice: {direct_result}")
        # Output: [b'Cat_000' b'Cat_001']
        print(f"Type: {type(direct_result[0])}")
        # Output: <class 'bytes'>

        # read_elem_partial: Returns decoded strings
        partial_result = read_elem_partial(arr, indices=slice(0, 2))
        print(f"read_elem_partial: {partial_result}")
        # Output: ['Cat_000' 'Cat_001']
        print(f"Type: {type(partial_result[0])}")
        # Output: <class 'str'>

Justification: read_elem_partial handles:

  • HDF5 byte-to-string decoding
  • Various string encodings (vlen strings, fixed-length)
  • Nullable string arrays with masks

Comment on lines 195 to 206
if self.__categories is not None:
return np.asarray(self.__categories[-n:])

if self._categories_array is None:
return np.array([])

from anndata._io.specs.registry import read_elem_partial

arr = self._get_categories_array()
total = self.n_categories
start = max(total - n, 0)
return read_elem_partial(arr, indices=slice(start, total))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated with head_categories - please deduplicate

- Use @cached_property for categories (cleaner than manual caching)
- Simplify cache detection to "categories" in self.__dict__
- Remove _cached_n_categories double caching (use shape[0] directly)
- Rename _categories_array to _categories_elem (reflects group case)
- Extract _read_partial_categories helper to deduplicate head/tail
- Add ZarrGroup | H5Group to type annotation (code handles it)
@katosh
Copy link
Contributor Author

katosh commented Jan 8, 2026

Thanks for the thorough review! I've implemented most of your suggestions:

Implemented:

  • @cached_property for categories - cleaner than manual caching
  • "categories" in self.__dict__ for cache detection
  • Removed _cached_n_categories double caching - now uses shape[0] directly
  • Renamed _categories_array_categories_elem
  • Extracted _read_partial_categories helper to deduplicate head/tail logic
  • Added ZarrGroup | H5Group to type annotation (you were right - if the code handles it, types should reflect that)

Kept for now with justification:

  • read_elem_partial instead of direct slicing - required for HDF5 byte-to-string decoding (direct slicing returns b'Cat_000' instead of 'Cat_000')
  • None support in type annotation - kept for API completeness/defensive programming, though I confirmed it's never used in practice (even empty categoricals write an empty array, not None)
  • name property - essential for dtype == "category" comparison in merge.py (CI failed without it)
  • __hash__ method - required for sets/dicts (e.g., collecting unique dtypes, @lru_cache functions)

Comment on lines 214 to 217
@property
def name(self) -> str:
"""String identifier for this dtype."""
return "category"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no code overriding the existing name on CategoricalDtype from which we inherit. I assume these lines work whether or not you have this property here or not because self.name should still be defined.

Comment on lines 424 to 437
from anndata.experimental.backed._lazy_arrays import LazyCategoricalDtype

categories = ["a", "b", "c"]
adata = AnnData(
X=np.zeros((3, 2)),
obs=pd.DataFrame({"cat": pd.Categorical(categories)}),
)

path = tmp_path / "test.zarr"
adata.write_zarr(path)

lazy = read_lazy(path)
dtype = lazy.obs["cat"].dtype
assert isinstance(dtype, LazyCategoricalDtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to go through AnnData for doing most of these tests, anndata.io.write_elem can handle writing a categorical and read_elem can return the in-memory once while read_elem_lazy will give you a CategoricalArray (although I think one test that embeds this inside in the anndata object enough and then tests that read_lazy(path).to_memory() == in_memory_adata is good). Then you could reuse the categorical fixture you create :)

katosh added 2 commits January 9, 2026 15:34
- Remove `name` property (inherited from CategoricalDtype)
- Remove `None` support from type annotations and guards
- Simplify `categories` property to use `read_elem` uniformly
- Unify `head_categories`/`tail_categories` into `_get_categories_slice` helper
- Keep `bool(ordered)` - required because HDF5 returns np.bool_
- Refactor tests to use `write_elem`/`read_elem_lazy` directly
- Update equality check for `None` categories comparison
@katosh
Copy link
Contributor Author

katosh commented Jan 9, 2026

Thanks for the thorough second review! I've addressed most of your suggestions. Here's a summary:

Implemented

  • Removed name property - You were right, it's inherited from CategoricalDtype as a class attribute. I had mistakenly thought it might be reset like __hash__ when defining __eq__, but that's not the case.

  • Removed None support - Removed from type annotations and all associated guards. The __eq__ method now returns False when comparing to a CategoricalDtype with None categories.

  • Simplified categories property - Now just return pd.Index(read_elem(self._categories_elem)). You were right that read_elem handles both zarr and h5 uniformly.

  • Refactored tests to use write_elem/read_elem_lazy - Most unit tests now work at the element level. Added a _write_categorical_zarr() helper for creating test fixtures.

Implemented slightly differently

  • head_categories/tail_categories refactor - I am not entirely sure what you mean. I refactored both to use a single _get_categories_slice method with a from_end to fork between the two cases internally while keeping the public API unchanged. Let me know if you'd prefer a different approach!

  • Integration test - I avoided the round trip throu Anndata in most test and only mad a single test_lazy_categorical_roundtrip_via_anndata integration test which tests the full workflow including read_lazy(path).to_memory() == original_adata. It also verifies dtype caching and ordered categoricals through the AnnData path.

Not yet implemented

  • bool(ordered) removal - I kept bool(ordered) because HDF5 returns np.bool_ instead of Python bool:

    >>> with h5py.File('test.h5', 'r') as f:
    ...     ordered = f['cat'].attrs['ordered']
    ...     print(type(ordered))
    <class 'numpy.bool'>

    While np.bool_ works in most contexts, normalizing to Python bool ensures consistent behavior for hashing and serialization. That said, if you'd prefer to remove it and handle np.bool_ downstream or trust it works fine, I'm happy to change it!

Let me know if you'd like any adjustments to the approach.

Comment on lines 200 to 207
if not isinstance(other, pd.CategoricalDtype):
return False
# Compare with regular CategoricalDtype - need to load categories
if self.ordered != other.ordered:
return False
if other.categories is None:
return False # LazyCategoricalDtype always has categories
return self.categories.equals(other.categories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how much more extensive the base implementation, I think we should just get our specialized checks out of the way fast and then fall back to that https://github.com/pandas-dev/pandas/blob/v2.3.3/pandas/core/dtypes/dtypes.py#L401

Comment on lines 177 to 181
def __repr__(self) -> str:
if "categories" in self.__dict__:
# Fully loaded - use standard repr
return f"CategoricalDtype(categories={self.categories!r}, ordered={self.ordered})"
return f"LazyCategoricalDtype(n_categories={self.n_categories}, ordered={self.ordered})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the repr should always show categories, but just the first n for some nice-seeming n?

Comment on lines 405 to 406
cat_group = _write_categorical_zarr(tmp_path, cat)
lazy_cat = read_elem_lazy(cat_group)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closer to what I want but I think the paradigm should be "write once, read many" i.e., write the fixture once (scope="session") and then have a fixture that does read_elem_lazy every time. You probably need one or two different fixtures (maybe for ordered and not and one or two other things) but I don't think every test (as appears here) needs its own special underlying categories array written to disc. You even have a few "fixtures" (the pd.Categorical at the beginning of each test that I would like to become a proper pytest.fixture) that are completely identical.

@ilan-gold
Copy link
Contributor

Every review, fewer comments, getting there, thanks :)

…ixtures

- Simplify __eq__ to defer to pandas base implementation after fast paths:
  1. Same Python object (identity check)
  2. Same on-disk location (avoids loading categories when comparing
     dtypes from the same file opened multiple times)
- Update __repr__ to always show categories (truncated for large n):
  small: LazyCategoricalDtype(categories=['a', 'b', 'c'])
  large: LazyCategoricalDtype(categories=['a', ..., 'z'], n=100)
- Extract _N_CATEGORIES_REPR_SHOW constant to module level
- Refactor tests to use session-scoped fixtures (write once, read many)
  instead of creating new categoricals in each test
@katosh
Copy link
Contributor Author

katosh commented Jan 12, 2026

Thanks for the review! I am glad to get this polished. Addressed all three points:

1. __eq__ simplification Now defers to super().__eq__() for pandas edge cases. Added two fast paths to avoid loading categories:

  • Same Python object (is check)
  • New: Same on-disk location check discovered that zarr/h5py arrays already compare equal by location (not content), so we just use arr1 == arr2 directly. This avoids loading categories when comparing dtypes from the same file opened multiple times.

2. __repr__ always shows categories Truncated for large counts:

LazyCategoricalDtype(categories=['a', 'b', 'c'])
LazyCategoricalDtype(categories=['cat_0', 'cat_1', 'cat_2', '...', 'cat_97', 'cat_98', 'cat_99'], n=100)

Moved constant to module level as _N_CATEGORIES_REPR_SHOW.

3. Test fixtures Refactored to session-scoped "write once, read many" pattern with 5 reusable fixtures.


Edit: Additional testing improvements after further review:

Verified arr1 == arr2 location-based comparison behavior:

  • Investigated h5py and zarr source code to confirm equality is location-based, not content-based
  • h5py: compares HDF5 object IDs via self.id == other.id (source)
  • zarr 3.x: uses dataclass-generated __eq__ comparing StorePath (URL string comparison)
  • Both return True for same location (even from different open() calls), False for different files with same content

Parametrized all LazyCategoricalDtype tests for both backends:

  • Refactored fixtures with helper functions for writing categorical data to zarr/h5ad
  • Created session-scoped path fixtures for each category type and backend
  • Created parametrized store fixtures that automatically test both zarr and h5ad

…ality

- Fix RUF005: use list unpacking [*head, "...", *tail]
- Remove _same_disk_location helper - zarr/h5py arrays already compare
  equal by on-disk location, not content
Verify that comparing two dtypes from the same file (opened twice)
uses the fast path and doesn't load categories.
Replace the previous same-location equality test with a more rigorous
parametrized test that covers both zarr and h5py backends.

The new test uses `unittest.mock.patch.object` to patch `__getitem__`
on the underlying category arrays to raise `AssertionError` if called.
This proves that both backends use location-based equality comparison
that doesn't read array contents:

- h5py: compares HDF5 object IDs (file number + object number)
- zarr 3.x: compares StorePath (URL string comparison via dataclass)

The previous test only verified our `LazyCategoricalDtype.categories`
cache wasn't populated, which doesn't prove the storage layer didn't
load data internally.
Refactor categorical test fixtures to support both backends:
- Add helper functions for writing categorical data to zarr/h5ad
- Create path fixtures for each category type and backend (session-scoped)
- Create parametrized store fixtures that test both zarr and h5ad

All LazyCategoricalDtype tests now run for both backends, increasing
test coverage from 12 to 24 tests:
- test_lazy_categorical_dtype_n_categories[zarr/h5ad]
- test_lazy_categorical_dtype_head_tail_categories[zarr/h5ad]
- test_lazy_categorical_dtype_categories_caching[zarr/h5ad]
- test_lazy_categorical_dtype_ordered[zarr/h5ad]
- test_lazy_categorical_dtype_repr[zarr-zarr/zarr-h5ad/h5ad-zarr/h5ad-h5ad]
- test_lazy_categorical_dtype_equality[zarr/h5ad]
- test_lazy_categorical_dtype_equality_no_load[zarr/h5ad]
- test_lazy_categorical_dtype_hash[zarr/h5ad]
- test_lazy_categorical_dtype_n_categories_from_cache[zarr/h5ad]
- test_lazy_categorical_dtype_name[zarr/h5ad]
- test_lazy_categorical_dtype_inequality_with_none_categories[zarr/h5ad]
…tion

Consolidate redundant tests and add proper verification for lazy behavior:

1. Merged n_categories tests:
   - test_lazy_categorical_dtype_n_categories now verifies:
     - Metadata-only access (categories not loaded)
     - Cache behavior after categories are loaded
   - Removed redundant test_lazy_categorical_dtype_n_categories_from_cache

2. Improved head_tail_categories test:
   - Added verification that partial reads don't load all categories
   - Each head/tail call now checks "categories" not in __dict__

3. Consolidated equality test:
   - Merged test_lazy_categorical_dtype_name (trivial 1-assertion test)
   - Merged test_lazy_categorical_dtype_inequality_with_none_categories
   - Now tests name property and None-categories edge case

Test count reduced from 24 to 18 while improving coverage quality:
- Tests now verify lazy behavior claims, not just return values
- Removed redundant test code without losing coverage
@katosh
Copy link
Contributor Author

katosh commented Jan 20, 2026

@ilan-gold if you like, I could also address #2296 in this PR by setting a default chunk size of 10,000 for category arrays at

_writer.write_elem(
g,
"categories",
v.categories.to_numpy(),
dataset_kwargs=dataset_kwargs,
)

by implementing

categories = v.categories.to_numpy()
cat_kwargs = dataset_kwargs
if len(categories) > 10_000 and "chunks" not in dataset_kwargs:
    cat_kwargs = dict(dataset_kwargs, chunks=(10_000,))
_writer.write_elem(g, "categories", categories, dataset_kwargs=cat_kwargs)

This would increase the benefit of this PR for zarr stores.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to look at the tests in a bit but in general they are still a little too repetitive. What is the difference between small medium large and 50? Why not parametrize by ordered and n_obs?

)

# Number of categories to show at head/tail in LazyCategoricalDtype repr
_N_CATEGORIES_REPR_SHOW = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even think pandas is this aggressive - It seems they use 10 so let's go with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Note that this will add up to a total of 20 previewed categories.

katosh and others added 3 commits January 23, 2026 11:09
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Reduce repetition in categorical test fixtures by using a config-driven
factory pattern instead of separate fixture groups for each category size.

Changes:
- Replace 15 individual fixtures with 3 generated fixtures + 1 data fixture
- Consolidate n50 and n100 into single n100 config (serves both use cases)
- Use `_make_cat_fixture()` factory for zarr/h5ad parametrization
- Update tests to use new fixture names (cat_n3_store, cat_n100_store)

Addresses review feedback about fixture repetitiveness.
@katosh
Copy link
Contributor Author

katosh commented Jan 23, 2026

Hi @ilan-gold,

Thanks for the continued review feedback! I've addressed your comments about the test fixtures being too repetitive.

Test fixture consolidation (ac1cab52)

Refactored the categorical fixtures from 15 individual fixtures to a config-driven factory pattern:

_CAT_CONFIGS = [
    ("n3", 3, False, ["a", "b", "c"]),      # basic tests, equality, hashing
    ("n100", 100, False, None),              # truncation, n_categories, head/tail
    ("ordered", 3, True, ["low", "medium", "high"]),
]

This follows the "write once, read many" pattern you suggested - data is written once per session via cat_data_paths, then _make_cat_fixture() generates store fixtures that open fresh handles for each test.

I also consolidated n50 and n100 into just n100 since it serves both the head/tail testing and truncation testing use cases.

Improved equality_no_load test (edb04fc2)

Switched from patching __getitem__ to patching read_elem:

  • __getitem__ on zarr/h5py arrays can't be reliably patched (C-level methods)
  • read_elem is the actual function called to load categories

Also added a positive control within the same test that verifies comparison with pd.CategoricalDtype does trigger read_elem, proving the patch approach works.

Note on force pushes

I made a few force pushes while iterating on the test improvements - apologies for the noise. The history should be clean now with just the two commits above on top of the previous work.

Let me know if there's anything else you'd like me to address!

@katosh katosh force-pushed the feat/lazy-categorical-dtype branch 2 times, most recently from d5ee71f to 2fddb8c Compare January 23, 2026 21:33
- Switch from patching __getitem__ to patching read_elem (more reliable)
- Add positive control: comparison with pd.CategoricalDtype triggers read_elem
- This proves both that the optimization works AND that the patch detects loads
@katosh katosh force-pushed the feat/lazy-categorical-dtype branch from 2fddb8c to edb04fc Compare January 23, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Efficient category count and partial loading for lazy AnnData

2 participants