fix: handle mixed-type columns from Excel in subventions pipeline#523
Merged
cyrilledaily merged 1 commit intomainfrom Feb 24, 2026
Merged
fix: handle mixed-type columns from Excel in subventions pipeline#523cyrilledaily merged 1 commit intomainfrom
cyrilledaily merged 1 commit intomainfrom
Conversation
Excel files can produce columns with mixed types (e.g. strings and datetime objects in the same column) which crash parquet serialization. This notably prevented ingestion of Antibes' subventions XLS file and at least 2 other datasets. - Add _coerce_object_columns_to_str step in TopicAggregator to cast mixed-type object columns to str before parquet write - Make _normalise_column_name robust to non-string column names (e.g. datetime objects auto-detected by Excel in header cells) Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
Author
Verification Results ✅Pipeline run completed successfully with this fix applied. The Antibes XLS file now normalizes correctly. Antibes subventions ingested
Other datasets rescued by this fixThe No regressions
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
datetimeobjects in the samedatesPeriodeVersementcolumn), which crash parquet serialization during the subventions ETL pipeline_coerce_object_columns_to_strstep inTopicAggregatorthat detects mixed-type object columns and casts them to string before parquet write_normalise_column_namerobust to non-string column names (e.g. datetime objects auto-detected by Excel in header cells), rescuing 3 additional datasetsRoot cause analysis
The Antibes XLS file has a
datesPeriodeVersementcolumn where some cells contain date ranges as strings ("2025-04-03/2025-09-09") while others contain single dates that Excel auto-converts todatetimeobjects. This mixed-type column causes pyarrow to fail when writing to parquet:Files changed
back/scripts/datasets/topic_aggregator.py_coerce_object_columns_to_strstatic method and wire it into_normalize_frameback/scripts/utils/dataframe_operation.py_normalise_column_nameaccept non-string inputs viastr()castTest plan
norm.parquetfor affected files) and verify Antibes subventions appear insubventions.parquetdatetime.datetime has no attribute 'lower'error now parse correctlyMade with Cursor