fix: incorrect fastparquet after multiple parquets #64016

bittoby · 2026-02-04T03:00:21Z

closes BUG: Index is incorrectly de-serialised by fastparquet after mulitple parquets written to different io.BytesIO streams with DataFrame.to_parquet #64007 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
I have reviewed and followed all the contribution guidelines
If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

Issue #64007: Index incorrectly de-serialised by fastparquet after multiple parquets written to different io.BytesIO streams

Problem Description

When using the fastparquet engine to write multiple DataFrames to separate io.BytesIO streams and then reading them back, the DataFrame indexes get swapped or corrupted between the different streams.

Root Cause

The issue was in fastparquet's pandas metadata handling when reading from BytesIO streams. Fastparquet has state contamination between different ParquetFile instances, causing index values to be mixed up between different streams.

Before and After

Before (Broken):

# DataFrame 1: index=[1, 2, 3, 4, 5] → result index=[2, 3, 4, 5, 6] ❌
# DataFrame 2: index=[2, 3, 4, 5, 6] → result index=[1, 2, 3, 4, 5] ❌

After (Fixed):

# DataFrame 1: index=[1, 2, 3, 4, 5] → result index=[1, 2, 3, 4, 5] ✅
# DataFrame 2: index=[2, 3, 4, 5, 6] → result index=[2, 3, 4, 5, 6] ✅

Closes #64007

bittoby · 2026-02-05T13:35:29Z

Please review my PR.

bittoby · 2026-02-06T20:14:42Z

@mroeschke Could you please review this PR? I would appreciate your feedback

sanrishi · 2026-02-07T14:44:28Z

pandas/io/parquet.py

+                # Workaround for fastparquet index restoration issue
+                # If pandas metadata indicates index columns, handle them manually
+                if (pandas_metadata and 
+                    'index_columns' in pandas_metadata and 
+                    pandas_metadata['index_columns'] and
+                    len(pandas_metadata['index_columns']) == 1):
+
+                    index_col_name = pandas_metadata['index_columns'][0]
+
+                    # Read all columns including the index column as regular data
+                    if hasattr(path, 'seek'):
+                        path.seek(0)


keep the workaround strictly single‑level only. If pandas_metadata["index_columns"] has more than 1 entry, skip this path and fall back to normal to_pandas to avoid MultiIndex regressions.

sanrishi · 2026-02-07T14:45:39Z

pandas/io/parquet.py

+                        df_with_index_col = parquet_file.to_pandas(
+                            columns=columns, filters=filters, index=False, **kwargs
+                        )


Here we force index=False but still pass through columns= — if the user filtered columns, the index column may be missing.

We should ensure index columns are added to the columns list before this call.

sanrishi · 2026-02-07T14:46:01Z

pandas/io/parquet.py

+                        # Check if the index column is present in the data
+                        if index_col_name in df_with_index_col.columns:
+                            # Extract the index values and set them as the DataFrame index
+                            index_values = df_with_index_col[index_col_name]
+                            df_without_index_col = df_with_index_col.drop(columns=[index_col_name])
+                            df_without_index_col.index = index_values
+                            # Preserve the original index name behavior (None for unnamed indexes)
+                            df_without_index_col.index.name = None


This unconditionally drops the index name. We should preserve the original index name if available (e.g., from pandas_metadata["index_names"]) or leave the existing name intact.

sanrishi · 2026-02-07T14:46:59Z

pandas/tests/io/test_parquet.py

        expected = df.copy()
        check_round_trip(df, temp_file, fp, expected=expected)

+    def test_bytesio_index_preservation(self, fp):


Could you parametrize the tests for better readability

sanrishi · 2026-02-07T14:51:14Z

pandas/tests/io/test_parquet.py

+    def test_bytesio_index_preservation(self, fp):
+        # GH #64007 - fastparquet incorrectly deserializes DataFrame indexes
+        # when multiple parquet files are written to separate BytesIO streams
+        import io


Could you move the test to test_fastparquet.py to keep engine‑specific coverage grouped there

sanrishi

Add new regression tests for (MultiIndex, columns=) also

jorisvandenbossche · 2026-02-09T08:14:47Z

As I mentioned on the issue, this seems to be a bug in fastparquet itself. So I am not convinced that we should workaround for it in pandas, instead of fixing it upstream in fastparquet.

sanrishi · 2026-02-09T08:55:44Z

@jorisvandenbossche

yeah that upstream is the right place for this.

Pandas-side workaround is indeed risky. It introduced regressions for MultiIndex (causing crashes) and broke columns= filtering (dropping the index).

I'll look into opening a PR on fastparquet to fix the root cause.

mroeschke · 2026-02-09T16:58:46Z

Thanks for the PR, but as discussed pandas should avoid workarounds in third party dependency bugs so closing

fix: incorrect fastparquet after multiple parquets

05089a8

sanrishi reviewed Feb 7, 2026

View reviewed changes

mroeschke closed this Feb 9, 2026

bittoby deleted the fix-de-serialized-fastparquet-multiple branch February 9, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: incorrect fastparquet after multiple parquets #64016

fix: incorrect fastparquet after multiple parquets #64016

bittoby commented Feb 4, 2026

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

bittoby commented Feb 6, 2026

Uh oh!

sanrishi Feb 7, 2026

Uh oh!

sanrishi Feb 7, 2026

Uh oh!

sanrishi Feb 7, 2026

Uh oh!

sanrishi Feb 7, 2026

Uh oh!

sanrishi Feb 7, 2026

Uh oh!

sanrishi left a comment •

edited

Loading

Uh oh!

jorisvandenbossche commented Feb 9, 2026

Uh oh!

sanrishi commented Feb 9, 2026

Uh oh!

mroeschke commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix: incorrect fastparquet after multiple parquets #64016

fix: incorrect fastparquet after multiple parquets #64016

Conversation

bittoby commented Feb 4, 2026

Issue #64007: Index incorrectly de-serialised by fastparquet after multiple parquets written to different io.BytesIO streams

Problem Description

Root Cause

Before and After

Before (Broken):

After (Fixed):

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

bittoby commented Feb 6, 2026

Uh oh!

sanrishi Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

sanrishi Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

sanrishi Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

sanrishi Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

sanrishi Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

sanrishi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 9, 2026

Uh oh!

sanrishi commented Feb 9, 2026

Uh oh!

mroeschke commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sanrishi left a comment •

edited

Loading