Skip to content

feat: parse wfs for meta and layers infos#385

Open
abulte wants to merge 7 commits intodatagouv:mainfrom
ecolabdata:feat/wfs
Open

feat: parse wfs for meta and layers infos#385
abulte wants to merge 7 commits intodatagouv:mainfrom
ecolabdata:feat/wfs

Conversation

@abulte
Copy link
Contributor

@abulte abulte commented Feb 3, 2026

This add support for WFS services to hydra 🐙.

If a WFS is url-detected, it will fetch general info (WFS version, output formats) and layers info (names and supported projections). The general workflow is respected (this can be challenged) : WFS is handled as any other resource type, and then augmented by the scraped infos.

The scraped infos are stored in checks.ogc_metadata as JSONB. They are sent is to udata in analysis:parsing:ogc_metadata for every resource. I went with a full object since the layer structure has to be complex anyway (ie more than a list of strings).

There's a bit of future-proofing here: ogc_metadata is ready for other formats if needed (eg WMS).

owslib is used to parse the WFS XML: lib seems mature and it's always nice to avoid parsing XMLs manually 😬.

I have tested end-to-end with a local udata instance.

Bonus: a bit of refactoring in cli.py with _find_check helper.

❯ uv run udata-hydra analyse-ogc --url 'https://geobretagne.fr/geoserver/hlc/wfs?REQUEST=GetCapabilities&SERVICE=WFS'
2026-02-04 16:51:55 dev.local owslib.feature.wfs100[86462] WARNING pyproj not installed
2026-02-04 16:51:55 dev.local udata-hydra[86462] ERROR Could not find a check linked to the specified URL
2026-02-04 16:51:55 dev.local udata-hydra[86462] DEBUG Starting OGC analysis for https://geobretagne.fr/geoserver/hlc/wfs?REQUEST=GetCapabilities&SERVICE=WFS
2026-02-04 16:51:55 dev.local udata-hydra[86462] DEBUG OGC analysis complete for https://geobretagne.fr/geoserver/hlc/wfs?REQUEST=GetCapabilities&SERVICE=WFS: 10 feature types found
2026-02-04 16:51:55 dev.local udata-hydra[86462] INFO OGC analysis completed successfully.
2026-02-04 16:51:55 dev.local udata-hydra[86462] DEBUG {
  "format": "wfs",
  "version": "2.0.0",
  "feature_types": [
    {
      "name": "hlc:classement_voirie_po",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:classement_voirie_li",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:dechetteries_pt",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:lieuxcovoiturage_hlc_pt",
      "default_crs": "EPSG:2154",
      "other_crs": []
    },
    {
      "name": "hlc:mediatheque_hlc_pt",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:pav_zoneinfluence_parcelle_po",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:dechetpav_pt",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:qualite_eaux_baignade_station",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:voirie_statistiques_par_ville",
      "default_crs": "EPSG:3948",
      "other_crs": []
    },
    {
      "name": "hlc:sd_observation_terrain_flore_rare_hlc_pt",
      "default_crs": "EPSG:3948",
      "other_crs": []
    }
  ],
  "output_formats": [
    "application/gml+xml; version=3.2",
    "GML2",
    "KML",
    "SHAPE-ZIP",
    "application/geopackage+sqlite3",
    "application/json",
    "application/vnd.google-earth.kml xml",
    "application/vnd.google-earth.kml+xml",
    "application/vnd.ogc.fg+json",
    "application/x-gpkg",
    "csv",
    "dxf",
    "geopackage",
    "geopkg",
    "gml3",
    "gml32",
    "gpkg",
    "gpx",
    "json",
    "mif",
    "ods",
    "tab",
    "text/csv",
    "text/xml; subtype=gml/2.1.2",
    "text/xml; subtype=gml/3.1.1",
    "text/xml; subtype=gml/3.2",
    "xlsx"
  ]
}

Fix ecolabdata/ecospheres#892
Related ecolabdata/ecospheres#846

@abulte
Copy link
Contributor Author

abulte commented Feb 3, 2026

TODO

  •  final review of tests
  • should we keep urn:ogc:def:crs:EPSG::3948 for projections or a more standard EPSG::3948? Short codes are probably better for our usages (frontend libs and QGIS), and it's supported by owslib.
  • use more hints for detection (resource.format)
  • future-proof naming (ogc vs wfs)

@abulte abulte marked this pull request as ready for review February 4, 2026 15:57
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It works nicely 🎉
Main main comment is on the scope of the hydra detection 😛


# Extract service metadata
try:
metadata = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to send the entire WFS capabilities for every layer in the resource extras?
For example this ressource would have 600+ layers described in its extra with 115 CRS for each.
If we end up with one WFS URL for each layer, we would be bloating the dataset a bit.

Shouldn't we resolve the layer in hydra and ignore the others?

Copy link
Contributor Author

@abulte abulte Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I was wary of doing too much business logic in hydra, but we could put the resource title / url layer name detection here.

But we also need the full layers list to show the user or the QGIS export when no valid layer name is detected (or even maybe to let the user switch layers easily, on a front map for example)...

So:

  1. do the layer matching game and full list of layers only when not valid?
  2. handle a MAX_LAYERS threshold
  3. both
  4. stay as is and let the consumers do the layer dance

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says 600+ layers with 115 CRS would be a 1MB payload, so that's definitely an issue...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And obviously, the biggest issue is the CRS multiplication. We could:

  1. store only the default CRS
  2. store only the CRS defined in an accept list (most common)

I'd say go with 1 and keep all the layers for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with storing only the default CRS for now.
For the layers, I think the layer deduction logic here is the most important part? I think listing the features is quite easy but detecting the layer is more complex part and thus would help the different frontend usages the most, WDYT?

payload["document"]["analysis:parsing:geojson_size"] = check.get("geojson_size")
if config.OGC_ANALYSIS_ENABLED and check.get("ogc_metadata"):
ogc_metadata = check.get("ogc_metadata")
if isinstance(ogc_metadata, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those days, it's always a string :p See here

def convert_dict_values_to_json(data: dict) -> dict:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crawl des couches et formats WFS dans Hydra

2 participants