Lightweight modelling workflow for EBDP TWIN2EXPAND project for evidence based approaches to urban design and planning.
NEW: This project includes a comprehensive data quality analysis framework that assigns confidence scores to each city based on POI data completeness and reliability.
- Quick Start: See DATA_QUALITY_QUICKSTART.md for a beginner-friendly guide
- Technical Details: See paper_data/code/README.md for the full pipeline
- Practical Guide: See paper_research/code/eg1_data_quality/README.md for filtering strategies
Key Features:
- Confidence scores (HIGH/MEDIUM/LOW) for all cities
- Validated against official datasets (France/Netherlands)
- 4 filtering scenarios for different research goals
- Automatic outputs (visualizations, reports, scores)
Project configuration is managed using a pyproject.toml file. For development purposes: uv is used for installation and management of the packages and related upgrades. For example: uv sync will install packages listed in the pyproject.toml file and creates a self-contained development environment in a .venv folder.
See the data_loading.md markdown file for data loading guidelines.
This repo depends on copy-left open source packages licensed as AGPLv3 and therefore adopts the same license. This is also in keeping with the intention of the TWIN2EXPAND project to create openly reproducible workflows.
The Overture Maps data source is licensed Community Data License Agreement – Permissive, Version 2.0 with some layers licensed as Open Data Commons Open Database License. OpenStreetMap data is © OpenStreetMap contributors
The data source is a combination of EU Copernicus data and Overture Maps, which largely resembles OpenStreetMap. Overture intends to provide a higher degree of data verification and issues fixed releases.
Boundaries are extracted from the 2021 Urban Centres / High Density Clusters dataset. This is 1x1km raster with high density clusters described as contiguous 1km2 cells with at least 1,500 residents per km2 and consisting of cumulative urban clusters with at least 50,000 people.
Download the dataset from the above link. Then run the generate_boundary_polys.py script to generate the vector boundaries from the raster source. Provide the input datapath to the TIFF file and the output file path for the generated vector boundaries in GPKG format. The generated GPKG will contain three layers named bounds, unioned_bounds_2000, and unioned_bounds_10000. This script will automatically remove boundaries intersecting the UK.
Example:
python -m src.data.generate_boundary_polys temp/HDENS-CLST-2021/HDENS_CLST_2021.tif temp/datasets/boundaries.gpkgurban atlas (~37GB vectors)
Run the load_urban_atlas_blocks.py script to generate the blocks data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded Urban Atlas data and an output path for the generated blocks GPKG.
Example:
python -m src.data.load_urban_atlas_blocks \
temp/datasets/boundaries.gpkg \
temp/UA_2018_3035_eu \
temp/datasets/blocks.gpkgTree cover (~36GB vectors).
Run the load_urban_atlas_trees.py script to generate the tree cover data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded STL data and an output path for the generated tree cover GPKG.
Example:
python -m src.data.load_urban_atlas_trees \
temp/datasets/boundaries.gpkg \
temp/STL_2018_3035_eu \
temp/datasets/tree_canopies.gpkgDigital Height Model (~ 1GB raster).
Run the load_bldg_hts_raster.py script to generate the building heights data. Provide the path to the boundaries GPKG generated previously, as well as the downloaded building height data and an output folder path for the extracted building heights TIFF files.
Example:
python -m src.data.load_bldg_hts_raster \
temp/datasets/boundaries.gpkg \
temp/Building_Height_2012_3035_eu \
temp/cities_data/heightsRun the load_overture.py script to download and prepare the overture data. The script will download the relevant Overture GPKG files for each boundary, clip them to the boundary, and save them to the output directory. Provide the path to the boundaries GPKG generated previously, as well as an output directory for the clipped Overture data. Optionally, you can specify the number of parallel workers to speed up the processing. By default, it uses 2 workers. Pass an additional argument --overwrite to redo processing for boundaries that already have corresponding Overture data in the output directory. Otherwise, existing data will be skipped. Each boundary will be saved as a separate GPKG file named with the boundary ID, containing layers for buildings, street edges, street nodes, a cleaned version of street edges clean_edges, POI places, and infrastructure.
python -m src.data.load_overture \
temp/datasets/boundaries.gpkg \
temp/cities_data/overture \
--parallel_workers 6docs/schema/concepts/by-theme/places/overture_categories.csv
The Overture POI schema is based on
overture_categories.csv.
GeoStat Census data for 2021 is downloaded from. These census statistics are aggregated to 1km2 cells.
Download the census ZIP dataset for Version 2021 (22 January 2025).
Compute metrics using the generate_metrics.py script. Provide the path to the boundaries GPKG, the directory containing the processed Overture data, the blocks GPKG, the tree canopies GPKG, the census GPKG, and an output directory for the generated metrics GPKG files.
python -m src.processing.generate_metrics \
temp/datasets/boundaries.gpkg \
temp/cities_data/overture \
temp/datasets/blocks.gpkg \
temp/datasets/tree_canopies.gpkg \
temp/cities_data/heights \
temp/Eurostat_Census-GRID_2021_V2/ESTAT_Census_2021_V2.gpkg \
temp/cities_data/processedThe POI characterisation pipeline compares Overture Maps POI spatial patterns against official business registries (SIRENE for France, BAG for Netherlands) across 127 reference cities. This establishes which types of analyses the data may support and at which spatial scales.
Run the characterisation pipeline:
python paper_data/code/validation/run_all.pyOr as part of the full data paper pipeline:
python paper_data/code/run_all.pyThe pipeline:
- Density residuals: For each POI category, fits
log(count) ~ log(population)via linear regression across all cities. Residuals indicate whether a city has more or fewer POIs than expected given its population. - Multi-scale grid comparison: Compares Overture and official POI distributions at multiple grid resolutions (15m--800m) using Cohen's kappa (presence similarity) and Spearman rho (count correlation).
- Bootstrap confidence intervals: City-level block bootstrap (N=1000) estimates uncertainty in similarity metrics.
- Country-split analysis: Compares France and Netherlands patterns to assess generalisability.
- Indicator analysis: Tests whether density residuals and mean Overture confidence correlate with pattern similarity (kappa) in reference cities.
- Pattern similarity improves with spatial aggregation: fine resolutions (<100m) show inverted correlations; 400m is the minimum for count-based analyses
- Consumer-facing categories (Retail, Eat & Drink) show highest similarity; Health & Medical is weakest
- Coverage averages ~45% of official counts, but spatial pattern similarity is independent of coverage
- Density residuals and Overture confidence may help screen non-characterised cities, but their predictive value varies by category
The POI characterisation pipeline compares Overture POI spatial patterns against official national business registries. These datasets are used for the characterisation described above and in the data paper but are not required to run the main SOAR pipeline.
Official Name: Base Sirène des entreprises et de leurs établissements (SIREN, SIRET)
Source: Institut national de la statistique et des études économiques (INSEE)
License: Open License 2.0 (Licence Ouverte / Etalab v2.0)
Description: National registry of all French businesses and establishments with economic activity codes (APE - Activité Principale Exercée), geographic coordinates, and administrative status.
Coverage: ~31 million establishments (as of January 2026)
Download:
- URL: https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/
- Direct link: https://files.data.gouv.fr/insee-sirene/StockEtablissement_utf8.zip
- Format: CSV (compressed as ZIP, ~3-4 GB) or Parquet
- Update frequency: Monthly
Citation:
INSEE (2026). Base Sirène des entreprises et de leurs établissements (SIREN, SIRET).
Institut National de la Statistique et des Études Économiques.
https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/
Retrieved: January 2026
Required columns:
siret- Unique establishment identifier (14 digits)activitePrincipaleEtablissement- Economic activity code (APE/NAF)etatAdministratifEtablissement- Administrative status (A=active, F=closed)coordonneeLambertAbscisseEtablissement- X coordinate (Lambert 93 projection)coordonneeLambertOrdonneeEtablissement- Y coordinate (Lambert 93 projection)
Classification system: APE codes follow the French NAF classification (Nomenclature d'Activités Française), derived from EU NACE Rev. 2 standard. See: https://www.insee.fr/en/metadonnees/nafr2/
Characterisation usage: Harmonised to 5 POI categories via APE code mapping (see paper_data/code/validation/ for details).
Official Name: Basisregistratie Adressen en Gebouwen (BAG) - Basic Registration of Addresses and Buildings
Source: Kadaster (Dutch Cadastre, Land Registry and Mapping Agency)
License: CC0 1.0 Universal (Public Domain)
Description: National registry of all buildings and addresses in the Netherlands with usage designations (gebruiksdoel), geometric footprints, and construction status.
Coverage: ~10 million buildings with ~18 million address objects (as of January 2026)
Download:
- URL: https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract
- Direct link: https://service.pdok.nl/lv/bag/atom/downloads/lvbag-extract-nl.zip
- Format: XML/GML files (compressed as ZIP, ~5 GB total for full national extract)
- Update frequency: Daily
Citation:
Kadaster (2026). Basisregistratie Adressen en Gebouwen (BAG) 2.0 Extract.
Kadaster, Dutch Land Registry and Mapping Agency.
https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract
Retrieved: January 2026
Required file: 9999VBO08012026.zip from the BAG extract
- VBO = Verblijfsobject (dwelling object / address object with usage purpose)
Required fields:
identificatie- Unique object identifier (16 digits)gebruiksdoel- Usage purpose designation (functional category)geometry- Building footprint or address point (RD New projection, EPSG:28992)status- Object status (use only active records)
Classification system: Gebruiksdoel (usage purposes) include:
woonfunctie- Residential functionwinkelfunctie- Shop/retail functionlogiesfunctie- Lodging/accommodation functionbijeenkomstfunctie- Meeting/assembly functiongezondheidszorgfunctie- Healthcare functiononderwijsfunctie- Education function- And others (see BAG documentation)
Documentation: https://zakelijk.kadaster.nl/bag-2.0-extract
Characterisation usage: Harmonised to 5 POI categories via usage purpose mapping (see paper_data/code/validation/ for details).
Location: Place reference datasets in temp/validation/ directory
Preparation script: paper_data/code/validation/run_all.py (which calls src/poi_patterns/setup/prepare_reference_data.py)
This script:
- Loads SIRENE Parquet/CSV and filters to active establishments with coordinates
- Extracts BAG VBO data from ZIP and filters to objects with usage purposes
- Converts coordinates to WGS84 (EPSG:4326)
- Filters to European geographic extent
- Saves processed GeoPackage files:
sirene_france.gpkg,bag_netherlands.gpkg
Usage:
# Place raw data in temp/validation/
# - StockEtablissement_utf8.parquet (or .csv)
# - lvbag-extract-nl/ (directory with ZIP files)
# Run preparation
python paper_research/code/eg1_data_quality/prepare_validation_data.py
# Outputs:
# - temp/validation/sirene_france.gpkg (~8-12 GB)
# - temp/validation/bag_netherlands.gpkg (~4-6 GB)Validation methodology: See paper_research/code/eg1_data_quality/VALIDATION_FRAMEWORK.md and VALIDATION_PROCESS_SUMMARY.md for complete documentation of:
- Category harmonization approach
- City selection strategy (24 cities stratified by population)
- Statistical analysis methods (Spearman rank correlation, systematic bias quantification)
- Interpretation guidelines
When using SOAR with validation:
For methods section:
The SOAR dataset was validated against official national registries: the French SIRENE business registry (INSEE, 2026) covering ~31 million establishments and the Netherlands BAG building registry (Kadaster, 2026) covering ~10 million buildings.
For acknowledgments:
This research used data from INSEE (Institut National de la Statistique et des Études Économiques) and Kadaster (Dutch Land Registry and Mapping Agency).
For data availability statement:
SIRENE data are publicly available from https://www.data.gouv.fr/en/datasets/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/ under Open License 2.0. BAG data are publicly available from https://www.kadaster.nl/zakelijk/producten/adressen-en-gebouwen/bag-2.0-extract under CC0 1.0 Universal license.