Skip to content

[BACK] Ajoute le dataset des élus#43

Merged
cgoudet merged 26 commits intomainfrom
add_elus
Feb 24, 2025
Merged

[BACK] Ajoute le dataset des élus#43
cgoudet merged 26 commits intomainfrom
add_elus

Conversation

@cgoudet
Copy link
Collaborator

@cgoudet cgoudet commented Feb 19, 2025

https://noco.services.dataforgood.fr/dashboard/#/nc/p6lbzxq9ra31no8/m6jcs2djf6sm3ee/Vue%20publique?rowId=36

  • Chaque resource de ce dataset contient un type d'élu différent (maire, sénateur, député, ...)

  • La page du dataset permet d'obtenir tous les liens des différents resources, qui sont alors individuellement téléchargés et transformées en parquet.

  • Les différents datasets sont finalement combinés en un unique dataset.

  • L'interaction avec l'API data.gouv a été reformat pour réutiliser la logique de datagouvsearcher dans EluWorflow.

Copy link
Contributor

@acornet acornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trop merci :) quelques remarques.

"""
Fetch information about all resources of a given dataset.
"""
save_filename = (savedir or Path(".")) / f"dataset_{dataset_id}.parquet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question for later: how do we handle udpates (new files)?

Copy link
Collaborator Author

@cgoudet cgoudet Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need at system of flags both global and step specific to force the download again.

For this file it's easy we can just remove the old files but for some other content we may want to create a combination of old and new content. Indeed this will need a discussion later on.


def _run_elus(self):
elus = ElusWorkflow(self.source_folder)
elus.fetch_raw_datasets()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange that the client needs to worry about this IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that should not be the responsibility of the worflow manager or that it should only call .run() and hide the 2 steps in there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected for point 2.

resources = dataset_resources(self.dataset_id, savedir=self.data_folder)
combined = []
for _, resource in tqdm(resources.iterrows()):
df = pd.read_parquet(self.raw_data_folder / f"{resource['resource_id']}.parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the path of the resource is part of fetch_raw_datasets's API, so it looks like it should return an array of file paths? having both fetch_raw_datasets and it client re-compute the same path seems wrong.
also, it seems wrong to read again something that we had in memory, unless we are dealing with a crazy scale. did you do this to reduce the memory footprint? what's the scale here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it was a bit overkill to write an re-read as we concat through pandas anyway.
I reformatted by keeping sub-datasets in memory.

Copy link
Contributor

@acornet acornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! let's just add paths to config, plus some minor nits.

@cgoudet cgoudet merged commit 3bdd915 into main Feb 24, 2025
2 checks passed
@cgoudet cgoudet deleted the add_elus branch February 24, 2025 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants