feat: add LazyCategoricalDtype for lazy categorical columns#2288
feat: add LazyCategoricalDtype for lazy categorical columns#2288katosh wants to merge 21 commits intoscverse:mainfrom
LazyCategoricalDtype for lazy categorical columns#2288Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2288 +/- ##
==========================================
- Coverage 86.74% 84.64% -2.10%
==========================================
Files 46 46
Lines 7204 7289 +85
==========================================
- Hits 6249 6170 -79
- Misses 955 1119 +164
|
The merge code checks `dtype == "category"` which requires LazyCategoricalDtype to handle string comparison in __eq__.
…-dtype # Conflicts: # src/anndata/experimental/backed/_lazy_arrays.py # tests/lazy/test_read.py
|
|
||
| arr = self._get_categories_array() | ||
| total = self.n_categories | ||
| return read_elem_partial(arr, indices=slice(0, min(n, total))) |
There was a problem hiding this comment.
If arr is just a {H5,Zarr}Array, just use their raw slicing methods
| return read_elem_partial(arr, indices=slice(0, min(n, total))) | |
| return arr[0:min(n, total))] |
There was a problem hiding this comment.
Just to check that this is what you want: Raw slicing can end up returning encoded byte strings while users might expect to receive the decoded strings.
import h5py
import tempfile
import numpy as np
from anndata._io.specs.registry import read_elem_partial
# Create HDF5 file with string data
with tempfile.NamedTemporaryFile(suffix='.h5') as f:
with h5py.File(f.name, 'w') as h5:
# Store strings (HDF5 stores as bytes internally)
h5.create_dataset('categories', data=['Cat_000', 'Cat_001', 'Cat_002'])
with h5py.File(f.name, 'r') as h5:
arr = h5['categories']
# DIRECT SLICING: Returns bytes
direct_result = arr[:2]
print(f"Direct slice: {direct_result}")
# Output: [b'Cat_000' b'Cat_001']
print(f"Type: {type(direct_result[0])}")
# Output: <class 'bytes'>
# read_elem_partial: Returns decoded strings
partial_result = read_elem_partial(arr, indices=slice(0, 2))
print(f"read_elem_partial: {partial_result}")
# Output: ['Cat_000' 'Cat_001']
print(f"Type: {type(partial_result[0])}")
# Output: <class 'str'>Justification: read_elem_partial handles:
- HDF5 byte-to-string decoding
- Various string encodings (vlen strings, fixed-length)
- Nullable string arrays with masks
| if self.__categories is not None: | ||
| return np.asarray(self.__categories[-n:]) | ||
|
|
||
| if self._categories_array is None: | ||
| return np.array([]) | ||
|
|
||
| from anndata._io.specs.registry import read_elem_partial | ||
|
|
||
| arr = self._get_categories_array() | ||
| total = self.n_categories | ||
| start = max(total - n, 0) | ||
| return read_elem_partial(arr, indices=slice(start, total)) |
There was a problem hiding this comment.
Duplicated with head_categories - please deduplicate
- Use @cached_property for categories (cleaner than manual caching) - Simplify cache detection to "categories" in self.__dict__ - Remove _cached_n_categories double caching (use shape[0] directly) - Rename _categories_array to _categories_elem (reflects group case) - Extract _read_partial_categories helper to deduplicate head/tail - Add ZarrGroup | H5Group to type annotation (code handles it)
|
Thanks for the thorough review! I've implemented most of your suggestions: Implemented:
Kept for now with justification:
|
| @property | ||
| def name(self) -> str: | ||
| """String identifier for this dtype.""" | ||
| return "category" |
There was a problem hiding this comment.
I think there is no code overriding the existing name on CategoricalDtype from which we inherit. I assume these lines work whether or not you have this property here or not because self.name should still be defined.
| from anndata.experimental.backed._lazy_arrays import LazyCategoricalDtype | ||
|
|
||
| categories = ["a", "b", "c"] | ||
| adata = AnnData( | ||
| X=np.zeros((3, 2)), | ||
| obs=pd.DataFrame({"cat": pd.Categorical(categories)}), | ||
| ) | ||
|
|
||
| path = tmp_path / "test.zarr" | ||
| adata.write_zarr(path) | ||
|
|
||
| lazy = read_lazy(path) | ||
| dtype = lazy.obs["cat"].dtype | ||
| assert isinstance(dtype, LazyCategoricalDtype) |
There was a problem hiding this comment.
I don't think you need to go through AnnData for doing most of these tests, anndata.io.write_elem can handle writing a categorical and read_elem can return the in-memory once while read_elem_lazy will give you a CategoricalArray (although I think one test that embeds this inside in the anndata object enough and then tests that read_lazy(path).to_memory() == in_memory_adata is good). Then you could reuse the categorical fixture you create :)
- Remove `name` property (inherited from CategoricalDtype) - Remove `None` support from type annotations and guards - Simplify `categories` property to use `read_elem` uniformly - Unify `head_categories`/`tail_categories` into `_get_categories_slice` helper - Keep `bool(ordered)` - required because HDF5 returns np.bool_ - Refactor tests to use `write_elem`/`read_elem_lazy` directly - Update equality check for `None` categories comparison
|
Thanks for the thorough second review! I've addressed most of your suggestions. Here's a summary: Implemented
Implemented slightly differently
Not yet implemented
Let me know if you'd like any adjustments to the approach. |
| if not isinstance(other, pd.CategoricalDtype): | ||
| return False | ||
| # Compare with regular CategoricalDtype - need to load categories | ||
| if self.ordered != other.ordered: | ||
| return False | ||
| if other.categories is None: | ||
| return False # LazyCategoricalDtype always has categories | ||
| return self.categories.equals(other.categories) |
There was a problem hiding this comment.
Considering how much more extensive the base implementation, I think we should just get our specialized checks out of the way fast and then fall back to that https://github.com/pandas-dev/pandas/blob/v2.3.3/pandas/core/dtypes/dtypes.py#L401
| def __repr__(self) -> str: | ||
| if "categories" in self.__dict__: | ||
| # Fully loaded - use standard repr | ||
| return f"CategoricalDtype(categories={self.categories!r}, ordered={self.ordered})" | ||
| return f"LazyCategoricalDtype(n_categories={self.n_categories}, ordered={self.ordered})" |
There was a problem hiding this comment.
Maybe the repr should always show categories, but just the first n for some nice-seeming n?
tests/lazy/test_read.py
Outdated
| cat_group = _write_categorical_zarr(tmp_path, cat) | ||
| lazy_cat = read_elem_lazy(cat_group) |
There was a problem hiding this comment.
This is closer to what I want but I think the paradigm should be "write once, read many" i.e., write the fixture once (scope="session") and then have a fixture that does read_elem_lazy every time. You probably need one or two different fixtures (maybe for ordered and not and one or two other things) but I don't think every test (as appears here) needs its own special underlying categories array written to disc. You even have a few "fixtures" (the pd.Categorical at the beginning of each test that I would like to become a proper pytest.fixture) that are completely identical.
|
Every review, fewer comments, getting there, thanks :) |
…ixtures
- Simplify __eq__ to defer to pandas base implementation after fast paths:
1. Same Python object (identity check)
2. Same on-disk location (avoids loading categories when comparing
dtypes from the same file opened multiple times)
- Update __repr__ to always show categories (truncated for large n):
small: LazyCategoricalDtype(categories=['a', 'b', 'c'])
large: LazyCategoricalDtype(categories=['a', ..., 'z'], n=100)
- Extract _N_CATEGORIES_REPR_SHOW constant to module level
- Refactor tests to use session-scoped fixtures (write once, read many)
instead of creating new categoricals in each test
|
Thanks for the review! I am glad to get this polished. Addressed all three points: 1.
2. Moved constant to module level as 3. Test fixtures Refactored to session-scoped "write once, read many" pattern with 5 reusable fixtures. Edit: Additional testing improvements after further review: Verified
Parametrized all LazyCategoricalDtype tests for both backends:
|
…ality - Fix RUF005: use list unpacking [*head, "...", *tail] - Remove _same_disk_location helper - zarr/h5py arrays already compare equal by on-disk location, not content
Verify that comparing two dtypes from the same file (opened twice) uses the fast path and doesn't load categories.
Replace the previous same-location equality test with a more rigorous parametrized test that covers both zarr and h5py backends. The new test uses `unittest.mock.patch.object` to patch `__getitem__` on the underlying category arrays to raise `AssertionError` if called. This proves that both backends use location-based equality comparison that doesn't read array contents: - h5py: compares HDF5 object IDs (file number + object number) - zarr 3.x: compares StorePath (URL string comparison via dataclass) The previous test only verified our `LazyCategoricalDtype.categories` cache wasn't populated, which doesn't prove the storage layer didn't load data internally.
Refactor categorical test fixtures to support both backends: - Add helper functions for writing categorical data to zarr/h5ad - Create path fixtures for each category type and backend (session-scoped) - Create parametrized store fixtures that test both zarr and h5ad All LazyCategoricalDtype tests now run for both backends, increasing test coverage from 12 to 24 tests: - test_lazy_categorical_dtype_n_categories[zarr/h5ad] - test_lazy_categorical_dtype_head_tail_categories[zarr/h5ad] - test_lazy_categorical_dtype_categories_caching[zarr/h5ad] - test_lazy_categorical_dtype_ordered[zarr/h5ad] - test_lazy_categorical_dtype_repr[zarr-zarr/zarr-h5ad/h5ad-zarr/h5ad-h5ad] - test_lazy_categorical_dtype_equality[zarr/h5ad] - test_lazy_categorical_dtype_equality_no_load[zarr/h5ad] - test_lazy_categorical_dtype_hash[zarr/h5ad] - test_lazy_categorical_dtype_n_categories_from_cache[zarr/h5ad] - test_lazy_categorical_dtype_name[zarr/h5ad] - test_lazy_categorical_dtype_inequality_with_none_categories[zarr/h5ad]
…tion
Consolidate redundant tests and add proper verification for lazy behavior:
1. Merged n_categories tests:
- test_lazy_categorical_dtype_n_categories now verifies:
- Metadata-only access (categories not loaded)
- Cache behavior after categories are loaded
- Removed redundant test_lazy_categorical_dtype_n_categories_from_cache
2. Improved head_tail_categories test:
- Added verification that partial reads don't load all categories
- Each head/tail call now checks "categories" not in __dict__
3. Consolidated equality test:
- Merged test_lazy_categorical_dtype_name (trivial 1-assertion test)
- Merged test_lazy_categorical_dtype_inequality_with_none_categories
- Now tests name property and None-categories edge case
Test count reduced from 24 to 18 while improving coverage quality:
- Tests now verify lazy behavior claims, not just return values
- Removed redundant test code without losing coverage
|
@ilan-gold if you like, I could also address #2296 in this PR by setting a default chunk size of 10,000 for category arrays at anndata/src/anndata/_io/specs/methods.py Lines 1107 to 1112 in c6f6f54 by implementing categories = v.categories.to_numpy()
cat_kwargs = dataset_kwargs
if len(categories) > 10_000 and "chunks" not in dataset_kwargs:
cat_kwargs = dict(dataset_kwargs, chunks=(10_000,))
_writer.write_elem(g, "categories", categories, dataset_kwargs=cat_kwargs)This would increase the benefit of this PR for zarr stores. |
ilan-gold
left a comment
There was a problem hiding this comment.
I will need to look at the tests in a bit but in general they are still a little too repetitive. What is the difference between small medium large and 50? Why not parametrize by ordered and n_obs?
| ) | ||
|
|
||
| # Number of categories to show at head/tail in LazyCategoricalDtype repr | ||
| _N_CATEGORIES_REPR_SHOW = 3 |
There was a problem hiding this comment.
I don't even think pandas is this aggressive - It seems they use 10 so let's go with that
There was a problem hiding this comment.
Done. Note that this will add up to a total of 20 previewed categories.
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Reduce repetition in categorical test fixtures by using a config-driven factory pattern instead of separate fixture groups for each category size. Changes: - Replace 15 individual fixtures with 3 generated fixtures + 1 data fixture - Consolidate n50 and n100 into single n100 config (serves both use cases) - Use `_make_cat_fixture()` factory for zarr/h5ad parametrization - Update tests to use new fixture names (cat_n3_store, cat_n100_store) Addresses review feedback about fixture repetitiveness.
|
Hi @ilan-gold, Thanks for the continued review feedback! I've addressed your comments about the test fixtures being too repetitive. Test fixture consolidation (
|
d5ee71f to
2fddb8c
Compare
- Switch from patching __getitem__ to patching read_elem (more reliable) - Add positive control: comparison with pd.CategoricalDtype triggers read_elem - This proves both that the optimization works AND that the patch detects loads
2fddb8c to
edb04fc
Compare
feat: add
LazyCategoricalDtypefor lazy categorical columnsSummary
Add
LazyCategoricalDtypeextendingpd.CategoricalDtypewith lazy loading support for categorical columns in lazy AnnData. This enables efficient access to categorical metadata without loading all categories into memory.Motivation
When working with lazy AnnData objects containing many categories (e.g., 100k+ cell IDs as categories), loading all categories just to display a preview or check metadata is inefficient. This is particularly important for:
API Design
Following Ilan's suggestion, the API uses familiar pandas naming conventions:
.categoriespd.Index.orderedbool.n_categoriesint.head_categories(n=5)np.ndarray.tail_categories(n=5)np.ndarrayThe
head/tailnaming follows pandasDataFrame.head()/DataFrame.tail()conventions.Implementation Details
LazyCategoricalDtypeextendspd.CategoricalDtypeto maintain compatibility.categoriesaccess and cachedhead_categories/tail_categoriesuseread_elem_partialfor efficient partial readsBenchmark Results
Tested with 100k categories (median of 5 runs):
n_categorieshead_categories(10)categories(full)Speedups vs full load:
n_categorieshead_categories(10)Note: zarr speedup for partial reads is limited because categories are currently written without explicit chunking.