Skip to content

fix: handle mixed-type columns from Excel in subventions pipeline#523

Merged
cyrilledaily merged 1 commit intomainfrom
fix/excel-mixed-types-subventions
Feb 24, 2026
Merged

fix: handle mixed-type columns from Excel in subventions pipeline#523
cyrilledaily merged 1 commit intomainfrom
fix/excel-mixed-types-subventions

Conversation

@cyrilledaily
Copy link
Collaborator

Summary

  • Excel files can produce columns with mixed types (e.g. strings and datetime objects in the same datesPeriodeVersement column), which crash parquet serialization during the subventions ETL pipeline
  • Concrete impact: this fix unblocks ingestion of Antibes' subventions (30 conventions from their "Données essentielles des conventions de subvention" XLS file published on data.gouv.fr) and at least 2 other datasets currently failing with the same error
  • Adds a _coerce_object_columns_to_str step in TopicAggregator that detects mixed-type object columns and casts them to string before parquet write
  • Makes _normalise_column_name robust to non-string column names (e.g. datetime objects auto-detected by Excel in header cells), rescuing 3 additional datasets

Root cause analysis

The Antibes XLS file has a datesPeriodeVersement column where some cells contain date ranges as strings ("2025-04-03/2025-09-09") while others contain single dates that Excel auto-converts to datetime objects. This mixed-type column causes pyarrow to fail when writing to parquet:

("Expected bytes, got a 'datetime.datetime' object", 
 'Conversion failed for column dates_periode_versement with type object')

Files changed

File Change
back/scripts/datasets/topic_aggregator.py Add _coerce_object_columns_to_str static method and wire it into _normalize_frame
back/scripts/utils/dataframe_operation.py Make _normalise_column_name accept non-string inputs via str() cast

Test plan

  • Re-run the subventions ETL pipeline (clearing cached norm.parquet for affected files) and verify Antibes subventions appear in subventions.parquet
  • Verify no regression on existing subventions data (same row count for other collectivities)
  • Check that the 3 datasets with datetime.datetime has no attribute 'lower' error now parse correctly

Made with Cursor

Excel files can produce columns with mixed types (e.g. strings and
datetime objects in the same column) which crash parquet serialization.
This notably prevented ingestion of Antibes' subventions XLS file
and at least 2 other datasets.

- Add _coerce_object_columns_to_str step in TopicAggregator to cast
  mixed-type object columns to str before parquet write
- Make _normalise_column_name robust to non-string column names
  (e.g. datetime objects auto-detected by Excel in header cells)

Co-authored-by: Cursor <cursoragent@cursor.com>
@cyrilledaily cyrilledaily merged commit 8433fb5 into main Feb 24, 2026
2 checks passed
@cyrilledaily
Copy link
Collaborator Author

Verification Results ✅

Pipeline run completed successfully with this fix applied. The Antibes XLS file now normalizes correctly.

Antibes subventions ingested

Metric Value
Rows ingested 17
Years covered 2022–2024
Total montant 2,100,633 €
Subventions score E → D
Global score B (unchanged — driven by marchés A)

Other datasets rescued by this fix

The _coerce_object_columns_to_str fix also rescued 2 other datasets that previously failed with the same datetime.datetime mixed-type error in datesPeriodeVersement.

No regressions

  • Total enriched subventions: 915,690 rows (up from ~900K)
  • Score distribution unchanged for existing collectivities
  • Key collectivities (Bordeaux, Lyon, Paris, Rouen, Rennes) verified unchanged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants