BUG: Pandas converts nullable int to float, even when this loses data#63925
BUG: Pandas converts nullable int to float, even when this loses data#63925kjmin622 wants to merge 15 commits intopandas-dev:mainfrom
Conversation
This comment was marked as spam.
This comment was marked as spam.
|
@aaron-seq Thank you for the review. As you suggested, adding a However, all tests are currently passing. Only one test failed, but it is caused by #63936 and is unrelated to the current code changes. |
|
@aaron-seq do not post AI generated pull request reviews again. Please review our AI policy. Similar contributions in the future may lead to a ban. |
Thanks for this, will note this when contributing in future |
f29bebb to
5ebba13
Compare
Instead of adding preserve_dtype, the map function was modified to return the same dtype as before. |
pandas/core/arrays/masked.py
Outdated
| try: | ||
| return type(self)._from_sequence(result, dtype=self.dtype) | ||
| except (ValueError, TypeError): | ||
| return result |
There was a problem hiding this comment.
This code will preserve the type if it can be preserved.
There was a problem hiding this comment.
Agreed we should return a masked array here, but if the user returns floats we should not convert them back to integers even if they have no fractional value. E.g.
ser = Series([1, 2, 3], dtype="Int64")
result = ser.apply(lambda x: 3.0)should result in Float64. Just use self._from_sequence I think.
There was a problem hiding this comment.
This can use self._cast_pointwise_result instead of _from_sequence, I think. That latter function was exactly added to be used in those kind of situations
xref #62164
@rhshadrach Thank you for your review. I reflected them. |
There was a problem hiding this comment.
I'm positive on preserving masked EAs in map/apply, things like this would be useful especially when making NumPy-nullable the default. But this can be a large break for current users. I'm thinking this could need a deprecation instead. Perhaps if we are going to make a feature flag for NumPy-nullable as a default this would go behind it?
pandas/core/arrays/masked.py
Outdated
| try: | ||
| return type(self)._from_sequence(result, dtype=self.dtype) | ||
| except (ValueError, TypeError): | ||
| return result |
There was a problem hiding this comment.
Agreed we should return a masked array here, but if the user returns floats we should not convert them back to integers even if they have no fractional value. E.g.
ser = Series([1, 2, 3], dtype="Int64")
result = ser.apply(lambda x: 3.0)should result in Float64. Just use self._from_sequence I think.
doc/source/whatsnew/v3.0.1.rst
Outdated
| - Fixed a bug in :func:`col` where unary operators (``-``, ``+``, ``abs``) were not supported (:issue:`63939`) | ||
| - Fixed a bug in the :func:`comparison_op` raising a ``TypeError`` for zerodim | ||
| subclasses of ``np.ndarray`` (:issue:`63205`) | ||
| - Fixed bug in :meth:`Series.apply` and :meth:`Series.map` where nullable integer dtypes were converted to float, causing precision loss for large integers (:issue:`63903`) |
There was a problem hiding this comment.
I believe this is not a regression, note should be in 3.1.0. Also need to note map/apply are now preserving NumPy-nullable EAs.
I lean towards "treat this as a bugfix" since "preserve dtype backend" is a mostly-consistent policy. But not a super-strong opinion. |
doc/source/whatsnew/v3.1.0.rst
Outdated
|
|
||
| Other enhancements | ||
| ^^^^^^^^^^^^^^^^^^ | ||
| - :meth:`Series.apply` and :meth:`Series.map` now preserve nullable (masked) extension array dtypes where appropriate; e.g. when the result is float, the output dtype is ``Float64`` rather than being cast back to the input dtype (:issue:`63903`). |
There was a problem hiding this comment.
I think this can be removed and...
doc/source/whatsnew/v3.1.0.rst
Outdated
|
|
||
| ExtensionArray | ||
| ^^^^^^^^^^^^^^ | ||
| - Fixed bug in :meth:`Series.apply` and :meth:`Series.map` where nullable integer dtypes were converted to float, causing precision loss for large integers (:issue:`63903`). |
There was a problem hiding this comment.
...add a little detail here.
| - Fixed bug in :meth:`Series.apply` and :meth:`Series.map` where nullable integer dtypes were converted to float, causing precision loss for large integers (:issue:`63903`). | |
| - Fixed bug in :meth:`Series.apply` and :meth:`Series.map` where nullable integer dtypes were converted to float, causing precision loss for large integers; now the nullable dtype will be preserved (:issue:`63903`). |
pandas/core/arrays/masked.py
Outdated
| mapper, | ||
| na_action=na_action, | ||
| ) | ||
| if isinstance(result, np.ndarray): |
There was a problem hiding this comment.
I think this should always be a NumPy array; can you check? Change this to assert isinstance(result, np.ndarray) and see that tests still pass.
There was a problem hiding this comment.
@rhshadrach I confirmed that the test passes even when I add assert isinstance(result, np.ndarray). Thank you.
|
@jbrockmendel @mroeschke - plan to merge in a few days if you want to take a look. |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.Summary
Fix precision loss when using
Series.apply()orSeries.map()on nullable integer dtypes (Int64,UInt64, etc.) withNonevalues.Problem
When applying a function to a Series with nullable integer dtype containing NA values, the data was being converted to
float64, causing precision loss for large integers that exceed float64's integer precision limit (2^53 ≈ 9×10^15).Before: 10000000000000002 (wrong - precision lost)
After: 10000000000000003 (correct)
Solution
Modified
BaseMaskedArray.map()to:to_numpy(dtype=object, na_value=pd.NA)instead ofto_numpy()to preserve integer values_cast_pointwise_result()to restore the appropriate nullable dtype