-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
When applying ufuncs to chunked feature arrays with skip_nodata=True, you could easily end up calling the ufunc with an empty array if a chunk contains only NoData. That's a problem for sklearn estimator methods that validate their inputs with ensure_min_samples=1 by default and fail with empty inputs. To avoid that issue, we added a corresponding ensure_min_samples parameter that temporarily fills input arrays with dummy values up to ensure_min_samples before passing them to the ufunc. That works, but the logic is somewhat complex, requires a handful of different validation checks, and is currently constrained to reshaped 2D data.
Proposed change
While ensure_min_samples supports arbitrary minimum samples, the only practical usage that I'm aware of is ensure_min_samples=1 (i.e. don't pass empty arrays), and that case could be handled with a simpler, faster approach of skipping the ufunc call entirely and just returning a fully masked output array. To do that, we would need to use output_sizes, output_dims, output_dtypes, and nodata_output to construct a correctly shaped and filled output (currently we use the partial result of the ufunc call to construct the output array), but I think that should be feasible.
Since the current solution works, this is probably only worth pursuing if there's a meaningful performance improvement and/or the current 2D dimensionality constraint becomes a limitation elsewhere, and if we can't think of other cases where a ufunc would require >1 sample (although you could always set skip_nodata=False as a last resort).
@grovduck, let me know if you see any limitations I'm not thinking of if we were to ditch ensure_min_samples in favor of this approach. This should probably be pretty low priority since the current solution works, but I wanted to record it as a possibility while I was thinking about it.