Skip to content

[BACK] Télécharge le siren#49

Merged
cgoudet merged 42 commits intomainfrom
back_add_sirene
Feb 27, 2025
Merged

[BACK] Télécharge le siren#49
cgoudet merged 42 commits intomainfrom
back_add_sirene

Conversation

@cgoudet
Copy link
Collaborator

@cgoudet cgoudet commented Feb 22, 2025

No description provided.

@cgoudet cgoudet marked this pull request as draft February 22, 2025 12:44
@acornet acornet changed the base branch from main to add_elus February 24, 2025 14:30
Copy link
Contributor

@acornet acornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, but I think we should hold on using polars until we have a very strong reason to do so, and here I don't think we do.

I am a bit worried that mixing pandas and polars is going to make it harder to people to contribute. now they would have to master both.
here I think we could do equally well in pandas, so let's stick to it?

if data_file.exists():
self._logger.info("Found OFGL data on disk, loading it.")
return pd.read_csv(data_file, sep=";")
return pd.read_csv(data_file, sep=";", dtype={"siren": "str"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't do this before writing the parquet instead?
we could prob add this to config under communities.ofgl.dtype

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is legacy code. This is precisely why I prefer parquet instead of CSV.

Are you OK for me to change this format?

pd.concat(dataframes, axis=0, ignore_index=True)
.astype({"SIREN": str})
.assign(
SIREN=lambda df: df["SIREN"].str.replace(".0", "").str.zfill(9),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those .0 really present in the raw data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Base automatically changed from add_elus to main February 24, 2025 21:05
@cgoudet
Copy link
Collaborator Author

cgoudet commented Feb 24, 2025

looking good, but I think we should hold on using polars until we have a very strong reason to do so, and here I don't think we do.

I am a bit worried that mixing pandas and polars is going to make it harder to people to contribute. now they would have to master both. here I think we could do equally well in pandas, so let's stick to it?

The Sirene dataset has 27M lines and the final version takes already 1.5G in memory. I'm not sure that people with low memory will be able pass this step with the memory optimization polars provide.

@cgoudet cgoudet marked this pull request as ready for review February 25, 2025 21:03
@cgoudet cgoudet merged commit 17e3dc8 into main Feb 27, 2025
2 checks passed
@cgoudet cgoudet deleted the back_add_sirene branch February 27, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants