Conversation
acornet
left a comment
There was a problem hiding this comment.
trop merci :) quelques remarques.
back/scripts/utils/datagouv_api.py
Outdated
| """ | ||
| Fetch information about all resources of a given dataset. | ||
| """ | ||
| save_filename = (savedir or Path(".")) / f"dataset_{dataset_id}.parquet" |
There was a problem hiding this comment.
open question for later: how do we handle udpates (new files)?
There was a problem hiding this comment.
I think we'll need at system of flags both global and step specific to force the download again.
For this file it's easy we can just remove the old files but for some other content we may want to create a combination of old and new content. Indeed this will need a discussion later on.
|
|
||
| def _run_elus(self): | ||
| elus = ElusWorkflow(self.source_folder) | ||
| elus.fetch_raw_datasets() |
There was a problem hiding this comment.
strange that the client needs to worry about this IMO
There was a problem hiding this comment.
Do you mean that should not be the responsibility of the worflow manager or that it should only call .run() and hide the 2 steps in there?
There was a problem hiding this comment.
I corrected for point 2.
| resources = dataset_resources(self.dataset_id, savedir=self.data_folder) | ||
| combined = [] | ||
| for _, resource in tqdm(resources.iterrows()): | ||
| df = pd.read_parquet(self.raw_data_folder / f"{resource['resource_id']}.parquet") |
There was a problem hiding this comment.
I think the path of the resource is part of fetch_raw_datasets's API, so it looks like it should return an array of file paths? having both fetch_raw_datasets and it client re-compute the same path seems wrong.
also, it seems wrong to read again something that we had in memory, unless we are dealing with a crazy scale. did you do this to reduce the memory footprint? what's the scale here?
There was a problem hiding this comment.
In this case it was a bit overkill to write an re-read as we concat through pandas anyway.
I reformatted by keeping sub-datasets in memory.
acornet
left a comment
There was a problem hiding this comment.
LGTM, thanks! let's just add paths to config, plus some minor nits.
https://noco.services.dataforgood.fr/dashboard/#/nc/p6lbzxq9ra31no8/m6jcs2djf6sm3ee/Vue%20publique?rowId=36
Chaque resource de ce dataset contient un type d'élu différent (maire, sénateur, député, ...)
La page du dataset permet d'obtenir tous les liens des différents resources, qui sont alors individuellement téléchargés et transformées en parquet.
Les différents datasets sont finalement combinés en un unique dataset.
L'interaction avec l'API data.gouv a été reformat pour réutiliser la logique de datagouvsearcher dans EluWorflow.