Skip to content

Consider removing ensure_min_samples and directly returning fully masked arrays #104

@aazuspan

Description

@aazuspan

Context

When applying ufuncs to chunked feature arrays with skip_nodata=True, you could easily end up calling the ufunc with an empty array if a chunk contains only NoData. That's a problem for sklearn estimator methods that validate their inputs with ensure_min_samples=1 by default and fail with empty inputs. To avoid that issue, we added a corresponding ensure_min_samples parameter that temporarily fills input arrays with dummy values up to ensure_min_samples before passing them to the ufunc. That works, but the logic is somewhat complex, requires a handful of different validation checks, and is currently constrained to reshaped 2D data.

Proposed change

While ensure_min_samples supports arbitrary minimum samples, the only practical usage that I'm aware of is ensure_min_samples=1 (i.e. don't pass empty arrays), and that case could be handled with a simpler, faster approach of skipping the ufunc call entirely and just returning a fully masked output array. To do that, we would need to use output_sizes, output_dims, output_dtypes, and nodata_output to construct a correctly shaped and filled output (currently we use the partial result of the ufunc call to construct the output array), but I think that should be feasible.

Since the current solution works, this is probably only worth pursuing if there's a meaningful performance improvement and/or the current 2D dimensionality constraint becomes a limitation elsewhere, and if we can't think of other cases where a ufunc would require >1 sample (although you could always set skip_nodata=False as a last resort).

@grovduck, let me know if you see any limitations I'm not thinking of if we were to ditch ensure_min_samples in favor of this approach. This should probably be pretty low priority since the current solution works, but I wanted to record it as a possibility while I was thinking about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions