Skip to content

brown-ccv/s12y-dry-bridge-scrape

Repository files navigation

Dry Bridge Solar Data ETL

An ETL (Extract, Transform, Load) pipeline for collecting and processing solar production data from Dry Bridge Solar Farm. This pipeline extracts data via web scraping, transforms it into standardized energy metrics, and loads it into a PostgreSQL database for analysis

Installation

This project uses modern Python tooling for dependency management and development. Choose one of the installation methods below:

uv is a fast Python package installer and resolver.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the project and its dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate

Configuration

Environment Setup

Copy the example environment file:

cp env.example .env

The env.example file contains credentials to connect to the docker development database. When you're ready to upload data to production, edit .env with the production database credentials

DB_HOST=localhost
DB_PORT=5432
DB_NAME=dry_bridge_db
DB_USER=your_username
DB_PASSWORD=your_password

Database Setup

The database is setup and managed by the DBA team. The Office of Sustainability and Resiliency is the data owner

Usage

The application provides a CLI interface with three main commands: extract, load, and refresh.

Docker Deployment

For production deployment in Kubernetes as a CronJob, see docs/DOCKER.md

Extracting Data

Extract solar data from the web dashboard:

# Extract all available data (from July 1, 2023 to now) into ./output
dry-bridge extract

# Extract data for a specific date range
dry-bridge extract 2023-08-01 2024-08-01

# Extract to a custom output directory
dry-bridge extract all now ./custom_output

Re-run the extract command to continue - it automatically skips existing files.

Loading Data

Load extracted data into the database:

# Load both raw and processed data
dry-bridge load

# Load only raw data
dry-bridge load --no-transform

# Load only processed data  
dry-bridge load --no-raw

# Load from custom output directory
dry-bridge load ./custom_output

Refreshing Data

Fill gaps and add new data (recommended for scheduled jobs):

# Query database for missing timestamps and fill gaps
dry-bridge refresh

The refresh command is designed to run on a schedule (e.g., every 15 minutes via cron or Kubernetes CronJob). It automatically:

  • Detects missing 15-minute intervals in the raw data table
  • Scrapes only the necessary dates to fill gaps
  • Adds new data as it becomes available
  • Retries failed fetches on the next run (no retry limits)

Complete ETL Pipeline

For a complete initial ETL run:

# Extract all data and load into database
dry-bridge extract && dry-bridge load

Database Schema

The application creates two tables:

dry_bridge_solar_processed

  • timestamp (TIMESTAMP, PRIMARY KEY): UTC timestamp
  • kw (FLOAT): Power in kilowatts
  • kwh (FLOAT): Energy in kilowatt-hours
  • mmbtu (FLOAT): Energy in million British thermal units
  • mtco2e (FLOAT): Carbon dioxide equivalent in metric tons avoided

dry_bridge_solar_raw

  • timestamp (TEXT): Original timestamp string
  • name (TEXT): Measurement source name
  • type (TEXT): Measurement type
  • units (TEXT): Units of measurement
  • value (FLOAT): Raw measurement value

Note: The raw table is the source of truth for what data has been fetched from the API.

Database Migration

The dry_bridge_fetch_attempts table is no longer used and can be safely dropped:

DROP TABLE IF EXISTS dry_bridge_fetch_attempts;

Data Export

Export processed data to CSV:

\copy (SELECT * FROM dry_bridge_solar_processed ORDER BY timestamp) TO 'solar_production_export.csv' WITH CSV HEADER;

Development

Development Setup

# Install with development dependencies
uv sync

Code Quality

This project uses several tools for code quality:

# Linting with ruff
uv run ruff check .

# Code formatting with ruff
uv run ruff format .

# Type checking
uv run mypy src/

# Dependency checking
uv run deptry .

Testing

Run the test suite:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=dry_bridge

# Run specific test file
uv run pytest tests/transform_test.py

Note: Tests require a test database named test_dry_bridge_db with the same user credentials.

Project Structure

src/dry_bridge/
├── __init__.py          # Package initialization and documentation
├── __main__.py          # CLI application entry point
├── scrape.py           # Web scraping functionality
├── transform.py        # Data transformation and calculations
└── load.py             # Database operations and loading

tests/
├── data/               # Test data files
└── transform_test.py   # Unit tests

# Configuration files
├── pyproject.toml      # Project metadata and dependencies
├── mise.toml          # Tool version management
├── uv.lock            # Locked dependency versions
└── .env.example       # Environment variable template

Troubleshooting

Common Issues

  1. Authentication Errors: The scraper relies on session cookies from the dashboard. If you encounter authentication issues, the dashboard may have changed its authentication mechanism.

  2. Database Connection: Ensure PostgreSQL is running and your credentials in .env are correct.

  3. Missing Data: Some days may have no data available from the solar farm. This is normal and will be logged as failed downloads.

  4. Timezone Issues: The application handles Eastern time to UTC conversion automatically, accounting for daylight saving time transitions.

Logging

The application uses Python's standard logging. To increase verbosity:

import logging
logging.basicConfig(level=logging.INFO)

License

This project is for internal use with the Dry Bridge solar farm data monitoring.

About

Pull CSV data on Brown's new solar investment

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •