ebdp-lightweight

Lightweight modelling workflow for EBDP TWIN2EXPAND project for evidence based approaches to urban design and planning.

Data Quality Confidence Scores

NEW: This project includes a comprehensive data quality analysis framework that assigns confidence scores to each city based on POI data completeness and reliability.

Quick Start: See DATA_QUALITY_QUICKSTART.md for a beginner-friendly guide
Technical Details: See paper_data/code/README.md for the full pipeline
Practical Guide: See paper_research/code/eg1_data_quality/README.md for filtering strategies

Key Features:

Confidence scores (HIGH/MEDIUM/LOW) for all cities
Validated against official datasets (France/Netherlands)
4 filtering scenarios for different research goals
Automatic outputs (visualizations, reports, scores)

Development

Project configuration is managed using a pyproject.toml file. For development purposes: uv is used for installation and management of the packages and related upgrades. For example: uv sync will install packages listed in the pyproject.toml file and creates a self-contained development environment in a .venv folder.

Data Loading

See the data_loading.md markdown file for data loading guidelines.

Licenses

This repo depends on copy-left open source packages licensed as AGPLv3 and therefore adopts the same license. This is also in keeping with the intention of the TWIN2EXPAND project to create openly reproducible workflows.

The Overture Maps data source is licensed Community Data License Agreement – Permissive, Version 2.0 with some layers licensed as Open Data Commons Open Database License. OpenStreetMap data is © OpenStreetMap contributors

Loading Notes

The data source is a combination of EU Copernicus data and Overture Maps, which largely resembles OpenStreetMap. Overture intends to provide a higher degree of data verification and issues fixed releases.

Boundaries

Boundaries are extracted from the 2021 Urban Centres / High Density Clusters dataset. This is 1x1km raster with high density clusters described as contiguous 1km2 cells with at least 1,500 residents per km2 and consisting of cumulative urban clusters with at least 50,000 people.

Download the dataset from the above link. Then run the generate_boundary_polys.py script to generate the vector boundaries from the raster source. Provide the input datapath to the TIFF file and the output file path for the generated vector boundaries in GPKG format. The generated GPKG will contain three layers named bounds, unioned_bounds_2000, and unioned_bounds_10000. This script will automatically remove boundaries intersecting the UK.

Example:

python -m src.data.generate_boundary_polys temp/HDENS-CLST-2021/HDENS_CLST_2021.tif temp/datasets/boundaries.gpkg

Urban Atlas

urban atlas (~37GB vectors)

Run the load_urban_atlas_blocks.py script to generate the blocks data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded Urban Atlas data and an output path for the generated blocks GPKG.

Example:

python -m src.data.load_urban_atlas_blocks \
        temp/datasets/boundaries.gpkg \
            temp/UA_2018_3035_eu \
                temp/datasets/blocks.gpkg

Tree cover

Tree cover (~36GB vectors).

Run the load_urban_atlas_trees.py script to generate the tree cover data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded STL data and an output path for the generated tree cover GPKG.

Example:

python -m src.data.load_urban_atlas_trees \
    temp/datasets/boundaries.gpkg \
        temp/STL_2018_3035_eu \
            temp/datasets/tree_canopies.gpkg

Building Heights

Digital Height Model (~ 1GB raster).

Run the load_bldg_hts_raster.py script to generate the building heights data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded building height data and an output folder path for the extracted building heights TIFF files.

Example:

python -m src.data.load_bldg_hts_raster \
    temp/datasets/boundaries.gpkg \
        temp/Building_Height_2012_3035_eu \
            temp/cities_data/heights

Ingesting Overture data

Run the load_overture.py script to download and prepare the overture data. The script will download the relevant Overture GPKG files for each boundary, clip them to the boundary, and save them to the output directory. Provide the path to the boundaries GPKG generated previously, as well as an output directory for the clipped Overture data. Optionally, you can specify the number of parallel workers to speed up the processing. By default, it uses 2 workers. Pass an additional argument --overwrite to redo processing for boundaries that already have corresponding Overture data in the output directory. Otherwise, existing data will be skipped. Each boundary will be saved as a separate GPKG file named with the boundary ID, containing layers for buildings, street edges, street nodes, a cleaned version of street edges clean_edges, POI places, and infrastructure.

python -m src.data.load_overture \
    temp/datasets/boundaries.gpkg \
        temp/cities_data/overture \
            --parallel_workers 6

docs/schema/concepts/by-theme/places/overture_categories.csv

The Overture POI schema is based on overture_categories.csv.

Census Data (2021)

GeoStat Census data for 2021 is downloaded from. These census statistics are aggregated to 1km2 cells.

Download the census ZIP dataset for Version 2021 (22 January 2025).

Metrics

Compute metrics using the generate_metrics.py script. Provide the path to the boundaries GPKG, the directory containing the processed Overture data, the blocks GPKG, the tree canopies GPKG, the census GPKG, and an output directory for the generated metrics GPKG files.

python -m src.processing.generate_metrics \
    temp/datasets/boundaries.gpkg \
        temp/cities_data/overture \
            temp/datasets/blocks.gpkg \
                temp/datasets/tree_canopies.gpkg \
                    temp/cities_data/heights \
                        temp/Eurostat_Census-GRID_2021_V2/ESTAT_Census_2021_V2.gpkg \
                            temp/cities_data/processed

POI Pattern Characterisation

The POI characterisation pipeline compares Overture Maps POI spatial patterns against official business registries (SIRENE for France, BAG for Netherlands) across 127 reference cities. This establishes which types of analyses the data may support and at which spatial scales.

Characterisation Workflow

Run the characterisation pipeline:

python paper_data/code/validation/run_all.py

Or as part of the full data paper pipeline:

python paper_data/code/run_all.py

The pipeline:

Density residuals: For each POI category, fits log(count) ~ log(population) via linear regression across all cities. Residuals indicate whether a city has more or fewer POIs than expected given its population.
Multi-scale grid comparison: Compares Overture and official POI distributions at multiple grid resolutions (15m--800m) using Cohen's kappa (presence similarity) and Spearman rho (count correlation).
Bootstrap confidence intervals: City-level block bootstrap (N=1000) estimates uncertainty in similarity metrics.
Country-split analysis: Compares France and Netherlands patterns to assess generalisability.
Indicator analysis: Tests whether density residuals and mean Overture confidence correlate with pattern similarity (kappa) in reference cities.

Key Findings

Pattern similarity improves with spatial aggregation: fine resolutions (<100m) show inverted correlations; 400m is the minimum for count-based analyses
Consumer-facing categories (Retail, Eat & Drink) show highest similarity; Health & Medical is weakest
Coverage averages ~45% of official counts, but spatial pattern similarity is independent of coverage
Density residuals and Overture confidence may help screen non-characterised cities, but their predictive value varies by category

Reference Datasets

The POI characterisation pipeline compares Overture POI spatial patterns against official national business registries. These datasets are used for the characterisation described above and in the data paper but are not required to run the main SOAR pipeline.

France: SIRENE Business Registry

Official Name: Base Sirène des entreprises et de leurs établissements (SIREN, SIRET)

Source: Institut national de la statistique et des études économiques (INSEE)

License: Open License 2.0 (Licence Ouverte / Etalab v2.0)

Description: National registry of all French businesses and establishments with economic activity codes (APE - Activité Principale Exercée), geographic coordinates, and administrative status.

Coverage: ~31 million establishments (as of January 2026)

Download:

URL: https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/
Direct link: https://files.data.gouv.fr/insee-sirene/StockEtablissement_utf8.zip
Format: CSV (compressed as ZIP, ~3-4 GB) or Parquet
Update frequency: Monthly

Citation:

INSEE (2026). Base Sirène des entreprises et de leurs établissements (SIREN, SIRET).
Institut National de la Statistique et des Études Économiques.
https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/
Retrieved: January 2026

Required columns:

siret - Unique establishment identifier (14 digits)
activitePrincipaleEtablissement - Economic activity code (APE/NAF)
etatAdministratifEtablissement - Administrative status (A=active, F=closed)
coordonneeLambertAbscisseEtablissement - X coordinate (Lambert 93 projection)
coordonneeLambertOrdonneeEtablissement - Y coordinate (Lambert 93 projection)

Classification system: APE codes follow the French NAF classification (Nomenclature d'Activités Française), derived from EU NACE Rev. 2 standard. See: https://www.insee.fr/en/metadonnees/nafr2/

Characterisation usage: Harmonised to 5 POI categories via APE code mapping (see paper_data/code/validation/ for details).

Netherlands: BAG Building Registry

Official Name: Basisregistratie Adressen en Gebouwen (BAG) - Basic Registration of Addresses and Buildings

Source: Kadaster (Dutch Cadastre, Land Registry and Mapping Agency)

License: CC0 1.0 Universal (Public Domain)

Description: National registry of all buildings and addresses in the Netherlands with usage designations (gebruiksdoel), geometric footprints, and construction status.

Coverage: ~10 million buildings with ~18 million address objects (as of January 2026)

Download:

URL: https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract
Direct link: https://service.pdok.nl/lv/bag/atom/downloads/lvbag-extract-nl.zip
Format: XML/GML files (compressed as ZIP, ~5 GB total for full national extract)
Update frequency: Daily

Citation:

Kadaster (2026). Basisregistratie Adressen en Gebouwen (BAG) 2.0 Extract.
Kadaster, Dutch Land Registry and Mapping Agency.
https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract
Retrieved: January 2026

Required file: 9999VBO08012026.zip from the BAG extract

VBO = Verblijfsobject (dwelling object / address object with usage purpose)

Required fields:

identificatie - Unique object identifier (16 digits)
gebruiksdoel - Usage purpose designation (functional category)
geometry - Building footprint or address point (RD New projection, EPSG:28992)
status - Object status (use only active records)

Classification system: Gebruiksdoel (usage purposes) include:

woonfunctie - Residential function
winkelfunctie - Shop/retail function
logiesfunctie - Lodging/accommodation function
bijeenkomstfunctie - Meeting/assembly function
gezondheidszorgfunctie - Healthcare function
onderwijsfunctie - Education function
And others (see BAG documentation)

Documentation: https://zakelijk.kadaster.nl/bag-2.0-extract

Characterisation usage: Harmonised to 5 POI categories via usage purpose mapping (see paper_data/code/validation/ for details).

Reference Data Preparation

Location: Place reference datasets in temp/validation/ directory

Preparation script: paper_data/code/validation/run_all.py (which calls src/poi_patterns/setup/prepare_reference_data.py)

This script:

Loads SIRENE Parquet/CSV and filters to active establishments with coordinates
Extracts BAG VBO data from ZIP and filters to objects with usage purposes
Converts coordinates to WGS84 (EPSG:4326)
Filters to European geographic extent
Saves processed GeoPackage files: sirene_france.gpkg, bag_netherlands.gpkg

Usage:

# Place raw data in temp/validation/
# - StockEtablissement_utf8.parquet (or .csv)
# - lvbag-extract-nl/ (directory with ZIP files)

# Run preparation
python paper_research/code/eg1_data_quality/prepare_validation_data.py

# Outputs:
# - temp/validation/sirene_france.gpkg (~8-12 GB)
# - temp/validation/bag_netherlands.gpkg (~4-6 GB)

Validation methodology: See paper_research/code/eg1_data_quality/VALIDATION_FRAMEWORK.md and VALIDATION_PROCESS_SUMMARY.md for complete documentation of:

Category harmonization approach
City selection strategy (24 cities stratified by population)
Statistical analysis methods (Spearman rank correlation, systematic bias quantification)
Interpretation guidelines

Data Citation Guidelines

When using SOAR with validation:

For methods section:

The SOAR dataset was validated against official national registries: the French SIRENE business registry (INSEE, 2026) covering ~31 million establishments and the Netherlands BAG building registry (Kadaster, 2026) covering ~10 million buildings.

For acknowledgments:

This research used data from INSEE (Institut National de la Statistique et des Études Économiques) and Kadaster (Dutch Land Registry and Mapping Agency).

For data availability statement:

SIRENE data are publicly available from https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/ under Open License 2.0. BAG data are publicly available from https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract under CC0 1.0 Universal license.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
paper_data		paper_data
paper_research		paper_research
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ebdp-lightweight

Data Quality Confidence Scores

Development

Data Loading

Licenses

Loading Notes

Boundaries

Urban Atlas

Tree cover

Building Heights

Ingesting Overture data

Census Data (2021)

Metrics

POI Pattern Characterisation

Characterisation Workflow

Key Findings

Reference Datasets

France: SIRENE Business Registry

Netherlands: BAG Building Registry

Reference Data Preparation

Data Citation Guidelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

UCL/t2e-soar

Folders and files

Latest commit

History

Repository files navigation

ebdp-lightweight

Data Quality Confidence Scores

Development

Data Loading

Licenses

Loading Notes

Boundaries

Urban Atlas

Tree cover

Building Heights

Ingesting Overture data

Census Data (2021)

Metrics

POI Pattern Characterisation

Characterisation Workflow

Key Findings

Reference Datasets

France: SIRENE Business Registry

Netherlands: BAG Building Registry

Reference Data Preparation

Data Citation Guidelines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages