Skip to content

Conversation

@bittoby
Copy link

@bittoby bittoby commented Feb 4, 2026

Issue #64007: Index incorrectly de-serialised by fastparquet after multiple parquets written to different io.BytesIO streams

Problem Description

When using the fastparquet engine to write multiple DataFrames to separate io.BytesIO streams and then reading them back, the DataFrame indexes get swapped or corrupted between the different streams.

Root Cause

The issue was in fastparquet's pandas metadata handling when reading from BytesIO streams. Fastparquet has state contamination between different ParquetFile instances, causing index values to be mixed up between different streams.

Before and After

Before (Broken):

# DataFrame 1: index=[1, 2, 3, 4, 5] → result index=[2, 3, 4, 5, 6] ❌
# DataFrame 2: index=[2, 3, 4, 5, 6] → result index=[1, 2, 3, 4, 5] ❌

After (Fixed):

# DataFrame 1: index=[1, 2, 3, 4, 5] → result index=[1, 2, 3, 4, 5] ✅
# DataFrame 2: index=[2, 3, 4, 5, 6] → result index=[2, 3, 4, 5, 6] ✅

Closes #64007

@bittoby
Copy link
Author

bittoby commented Feb 5, 2026

Please review my PR.

@bittoby
Copy link
Author

bittoby commented Feb 6, 2026

@mroeschke Could you please review this PR? I would appreciate your feedback

Comment on lines +410 to +421
# Workaround for fastparquet index restoration issue
# If pandas metadata indicates index columns, handle them manually
if (pandas_metadata and
'index_columns' in pandas_metadata and
pandas_metadata['index_columns'] and
len(pandas_metadata['index_columns']) == 1):

index_col_name = pandas_metadata['index_columns'][0]

# Read all columns including the index column as regular data
if hasattr(path, 'seek'):
path.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the workaround strictly single‑level only. If pandas_metadata["index_columns"] has more than 1 entry, skip this path and fall back to normal to_pandas to avoid MultiIndex regressions.

Comment on lines +425 to +427
df_with_index_col = parquet_file.to_pandas(
columns=columns, filters=filters, index=False, **kwargs
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we force index=False but still pass through columns= — if the user filtered columns, the index column may be missing.

We should ensure index columns are added to the columns list before this call.

Comment on lines +429 to +436
# Check if the index column is present in the data
if index_col_name in df_with_index_col.columns:
# Extract the index values and set them as the DataFrame index
index_values = df_with_index_col[index_col_name]
df_without_index_col = df_with_index_col.drop(columns=[index_col_name])
df_without_index_col.index = index_values
# Preserve the original index name behavior (None for unnamed indexes)
df_without_index_col.index.name = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unconditionally drops the index name. We should preserve the original index name if available (e.g., from pandas_metadata["index_names"]) or leave the existing name intact.

expected = df.copy()
check_round_trip(df, temp_file, fp, expected=expected)

def test_bytesio_index_preservation(self, fp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you parametrize the tests for better readability

Comment on lines +1378 to +1381
def test_bytesio_index_preservation(self, fp):
# GH #64007 - fastparquet incorrectly deserializes DataFrame indexes
# when multiple parquet files are written to separate BytesIO streams
import io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the test to test_fastparquet.py to keep engine‑specific coverage grouped there

Copy link
Contributor

@sanrishi sanrishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new regression tests for (MultiIndex, columns=) also

@jorisvandenbossche
Copy link
Member

As I mentioned on the issue, this seems to be a bug in fastparquet itself. So I am not convinced that we should workaround for it in pandas, instead of fixing it upstream in fastparquet.

@sanrishi
Copy link
Contributor

sanrishi commented Feb 9, 2026

@jorisvandenbossche

yeah that upstream is the right place for this.

Pandas-side workaround is indeed risky. It introduced regressions for MultiIndex (causing crashes) and broke columns= filtering (dropping the index).

I'll look into opening a PR on fastparquet to fix the root cause.

@mroeschke
Copy link
Member

Thanks for the PR, but as discussed pandas should avoid workarounds in third party dependency bugs so closing

@mroeschke mroeschke closed this Feb 9, 2026
@bittoby bittoby deleted the fix-de-serialized-fastparquet-multiple branch February 9, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Index is incorrectly de-serialised by fastparquet after mulitple parquets written to different io.BytesIO streams with DataFrame.to_parquet

4 participants