-
-
Notifications
You must be signed in to change notification settings - Fork 19.6k
fix: incorrect fastparquet after multiple parquets #64016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: incorrect fastparquet after multiple parquets #64016
Conversation
|
Please review my PR. |
|
@mroeschke Could you please review this PR? I would appreciate your feedback |
| # Workaround for fastparquet index restoration issue | ||
| # If pandas metadata indicates index columns, handle them manually | ||
| if (pandas_metadata and | ||
| 'index_columns' in pandas_metadata and | ||
| pandas_metadata['index_columns'] and | ||
| len(pandas_metadata['index_columns']) == 1): | ||
|
|
||
| index_col_name = pandas_metadata['index_columns'][0] | ||
|
|
||
| # Read all columns including the index column as regular data | ||
| if hasattr(path, 'seek'): | ||
| path.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the workaround strictly single‑level only. If pandas_metadata["index_columns"] has more than 1 entry, skip this path and fall back to normal to_pandas to avoid MultiIndex regressions.
| df_with_index_col = parquet_file.to_pandas( | ||
| columns=columns, filters=filters, index=False, **kwargs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we force index=False but still pass through columns= — if the user filtered columns, the index column may be missing.
We should ensure index columns are added to the columns list before this call.
| # Check if the index column is present in the data | ||
| if index_col_name in df_with_index_col.columns: | ||
| # Extract the index values and set them as the DataFrame index | ||
| index_values = df_with_index_col[index_col_name] | ||
| df_without_index_col = df_with_index_col.drop(columns=[index_col_name]) | ||
| df_without_index_col.index = index_values | ||
| # Preserve the original index name behavior (None for unnamed indexes) | ||
| df_without_index_col.index.name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unconditionally drops the index name. We should preserve the original index name if available (e.g., from pandas_metadata["index_names"]) or leave the existing name intact.
| expected = df.copy() | ||
| check_round_trip(df, temp_file, fp, expected=expected) | ||
|
|
||
| def test_bytesio_index_preservation(self, fp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you parametrize the tests for better readability
| def test_bytesio_index_preservation(self, fp): | ||
| # GH #64007 - fastparquet incorrectly deserializes DataFrame indexes | ||
| # when multiple parquet files are written to separate BytesIO streams | ||
| import io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the test to test_fastparquet.py to keep engine‑specific coverage grouped there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new regression tests for (MultiIndex, columns=) also
|
As I mentioned on the issue, this seems to be a bug in fastparquet itself. So I am not convinced that we should workaround for it in pandas, instead of fixing it upstream in fastparquet. |
|
yeah that upstream is the right place for this. Pandas-side workaround is indeed risky. It introduced regressions for MultiIndex (causing crashes) and broke columns= filtering (dropping the index). I'll look into opening a PR on fastparquet to fix the root cause. |
|
Thanks for the PR, but as discussed pandas should avoid workarounds in third party dependency bugs so closing |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.Issue #64007: Index incorrectly de-serialised by fastparquet after multiple parquets written to different io.BytesIO streams
Problem Description
When using the fastparquet engine to write multiple DataFrames to separate
io.BytesIOstreams and then reading them back, the DataFrame indexes get swapped or corrupted between the different streams.Root Cause
The issue was in fastparquet's pandas metadata handling when reading from BytesIO streams. Fastparquet has state contamination between different ParquetFile instances, causing index values to be mixed up between different streams.
Before and After
Before (Broken):
After (Fixed):
Closes #64007