An ETL (Extract, Transform, Load) pipeline for collecting and processing solar production data from Dry Bridge Solar Farm. This pipeline extracts data via web scraping, transforms it into standardized energy metrics, and loads it into a PostgreSQL database for analysis
This project uses modern Python tooling for dependency management and development. Choose one of the installation methods below:
uv is a fast Python package installer and resolver.
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install the project and its dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activateCopy the example environment file:
cp env.example .envThe env.example file contains credentials to connect to the docker development database. When you're ready to upload data to production, edit .env with the production database credentials
DB_HOST=localhost
DB_PORT=5432
DB_NAME=dry_bridge_db
DB_USER=your_username
DB_PASSWORD=your_passwordThe database is setup and managed by the DBA team. The Office of Sustainability and Resiliency is the data owner
The application provides a CLI interface with three main commands: extract, load, and refresh.
For production deployment in Kubernetes as a CronJob, see docs/DOCKER.md
Extract solar data from the web dashboard:
# Extract all available data (from July 1, 2023 to now) into ./output
dry-bridge extract
# Extract data for a specific date range
dry-bridge extract 2023-08-01 2024-08-01
# Extract to a custom output directory
dry-bridge extract all now ./custom_outputRe-run the extract command to continue - it automatically skips existing files.
Load extracted data into the database:
# Load both raw and processed data
dry-bridge load
# Load only raw data
dry-bridge load --no-transform
# Load only processed data
dry-bridge load --no-raw
# Load from custom output directory
dry-bridge load ./custom_outputFill gaps and add new data (recommended for scheduled jobs):
# Query database for missing timestamps and fill gaps
dry-bridge refreshThe refresh command is designed to run on a schedule (e.g., every 15 minutes via cron or Kubernetes CronJob). It automatically:
- Detects missing 15-minute intervals in the raw data table
- Scrapes only the necessary dates to fill gaps
- Adds new data as it becomes available
- Retries failed fetches on the next run (no retry limits)
For a complete initial ETL run:
# Extract all data and load into database
dry-bridge extract && dry-bridge loadThe application creates two tables:
timestamp(TIMESTAMP, PRIMARY KEY): UTC timestampkw(FLOAT): Power in kilowattskwh(FLOAT): Energy in kilowatt-hoursmmbtu(FLOAT): Energy in million British thermal unitsmtco2e(FLOAT): Carbon dioxide equivalent in metric tons avoided
timestamp(TEXT): Original timestamp stringname(TEXT): Measurement source nametype(TEXT): Measurement typeunits(TEXT): Units of measurementvalue(FLOAT): Raw measurement value
Note: The raw table is the source of truth for what data has been fetched from the API.
The dry_bridge_fetch_attempts table is no longer used and can be safely dropped:
DROP TABLE IF EXISTS dry_bridge_fetch_attempts;Export processed data to CSV:
\copy (SELECT * FROM dry_bridge_solar_processed ORDER BY timestamp) TO 'solar_production_export.csv' WITH CSV HEADER;# Install with development dependencies
uv syncThis project uses several tools for code quality:
# Linting with ruff
uv run ruff check .
# Code formatting with ruff
uv run ruff format .
# Type checking
uv run mypy src/
# Dependency checking
uv run deptry .Run the test suite:
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=dry_bridge
# Run specific test file
uv run pytest tests/transform_test.pyNote: Tests require a test database named test_dry_bridge_db with the same user credentials.
src/dry_bridge/
├── __init__.py # Package initialization and documentation
├── __main__.py # CLI application entry point
├── scrape.py # Web scraping functionality
├── transform.py # Data transformation and calculations
└── load.py # Database operations and loading
tests/
├── data/ # Test data files
└── transform_test.py # Unit tests
# Configuration files
├── pyproject.toml # Project metadata and dependencies
├── mise.toml # Tool version management
├── uv.lock # Locked dependency versions
└── .env.example # Environment variable template
-
Authentication Errors: The scraper relies on session cookies from the dashboard. If you encounter authentication issues, the dashboard may have changed its authentication mechanism.
-
Database Connection: Ensure PostgreSQL is running and your credentials in
.envare correct. -
Missing Data: Some days may have no data available from the solar farm. This is normal and will be logged as failed downloads.
-
Timezone Issues: The application handles Eastern time to UTC conversion automatically, accounting for daylight saving time transitions.
The application uses Python's standard logging. To increase verbosity:
import logging
logging.basicConfig(level=logging.INFO)This project is for internal use with the Dry Bridge solar farm data monitoring.