Conversation
acornet
left a comment
There was a problem hiding this comment.
looking good, but I think we should hold on using polars until we have a very strong reason to do so, and here I don't think we do.
I am a bit worried that mixing pandas and polars is going to make it harder to people to contribute. now they would have to master both.
here I think we could do equally well in pandas, so let's stick to it?
| if data_file.exists(): | ||
| self._logger.info("Found OFGL data on disk, loading it.") | ||
| return pd.read_csv(data_file, sep=";") | ||
| return pd.read_csv(data_file, sep=";", dtype={"siren": "str"}) |
There was a problem hiding this comment.
shouldn't do this before writing the parquet instead?
we could prob add this to config under communities.ofgl.dtype
There was a problem hiding this comment.
This is legacy code. This is precisely why I prefer parquet instead of CSV.
Are you OK for me to change this format?
| pd.concat(dataframes, axis=0, ignore_index=True) | ||
| .astype({"SIREN": str}) | ||
| .assign( | ||
| SIREN=lambda df: df["SIREN"].str.replace(".0", "").str.zfill(9), |
There was a problem hiding this comment.
are those .0 really present in the raw data?
The Sirene dataset has 27M lines and the final version takes already 1.5G in memory. I'm not sure that people with low memory will be able pass this step with the memory optimization polars provide. |
No description provided.