A professional, reproducible MLOps-style project built on Colombia’s GEIH household survey to study whether subsidies reduce inequality and to predict potential subsidy candidates under extreme class imbalance. The original notebooks pipelines are preserved in notebooks/, while the production path is modularized into pipelines, configs, and CLI tools.
- Overview
- Descriptive Analysis (Selected Figures)
- MLOps Scope
- Project Structure
- End-to-End Pipeline
- Model Results Summary
- Configuration
- Local CLI Workflows
- Docker Strategy
- Docker Compose Profiles
- Orchestration & Reproducibility
- Experiment Tracking
- API Deployment
- Monitoring & Drift
- Artifacts
- Testing
- Notebooks
- Roadmap
This repository delivers:
- A modular ML stack for subsidy prediction with severe class imbalance.
- A robust supervised cascade pipeline (XGBoost + RandomForest) with feature engineering, hyperparameter search, and threshold optimization.
- Unsupervised anomaly baselines (IsolationForest / OneClassSVM) with score-threshold tuning.
- Operational MLOps components: MLflow tracking, DVC pipelines, Kubeflow compilation, FastAPI serving, and Evidently drift checks.
- Container-first execution for both training jobs and serving workloads.
Below are selected figures extracted from the original descriptive notebook. These are intentionally curated (not all plots) and laid out for readability.
|
|
| Figure 2.1. Summary distribution & key diagnostics (left) and complementary descriptive patterns (right). | |
|
|
| Figure 2.2. Additional distributional comparisons and subgroup contrasts. | |
This project is organized as a production-oriented MLOps workflow:
- Data layer: deterministic data preparation with config-driven inputs.
- Training layer: supervised and unsupervised pipelines with reproducible splits and saved artifacts.
- Evaluation layer: metrics persistence (
metrics.json,metrics_eval.json) and threshold-aware reporting. - Serving layer: FastAPI API with schemas and model metadata endpoints.
- Monitoring layer: Evidently drift checks over reference vs current data.
- Orchestration layer: DVC stage graph + Kubeflow pipeline compilation.
- Tracking layer: optional MLflow logging for params, metrics, artifacts, and run tags.
.
├─ artifacts/ # model artifacts, predictions, drift reports, mlflow backend
├─ configs/ # yaml configs for dataset, training, drift
├─ data/
│ ├─ raw/ # GEIH raw sources
│ └─ processed/ # training-ready tables
├─ docs/
├─ notebooks/ # research / analysis history
├─ scripts/ # thin wrappers for jobs
├─ src/colombia_subsidy_ml/
│ ├─ api/ # FastAPI app, schemas, model loading
│ ├─ data/ # ingestion and dataset building
│ ├─ features/ # preprocessing and feature pipeline
│ ├─ mlops/ # kubeflow pipeline compilation
│ ├─ models/ # cascade model, factory, tuning, artifact io
│ ├─ pipelines/ # train/evaluate/predict/drift workflows
│ ├─ tracking/ # MLflow helpers
│ └─ utils/
├─ tests/
├─ .dockerignore
├─ docker-compose.yml
├─ dvc.yaml
├─ Dockerfile
└─ pyproject.toml
Raw GEIH Data
-> build-dataset
-> train (cascade / anomaly)
-> evaluate + predict
-> drift-check (Evidently)
-> API serving (FastAPI)
Cross-cutting concerns:
- Tracking: MLflow (optional, config-driven).
- Reproducibility: DVC stage graph + deterministic configs.
- Orchestration: Kubeflow pipeline compilation for CI/CD or cluster execution.
The following table summarizes key reference results obtained in the modeling notebook (notebooks/Full Maching Learning Modeling.ipynb):
| Model / Strategy | Precision (Subsidio=1) | Recall (Subsidio=1) | F1 (Subsidio=1) | Main Trade-off |
|---|---|---|---|---|
| Cascade (XGBoost + RF + threshold tuning) | 0.218 | 0.711 | ~0.33 | Meets recall target with moderate precision |
| One-Class SVM (anomaly framing) | 0.934 | 0.264 | 0.411 | Very high precision, low recall |
| IsolationForest (anomaly framing) | 0.926 | 0.303 | 0.457 | Very high precision, low recall |
For production runs, use the current artifacts (artifacts/*/metrics.json) as the source of truth.
Core configs:
configs/dataset.yaml: input raw tables and processed output path.configs/train_cascade.yaml: supervised cascade config (feature engineering, resampling, search, thresholds, MLflow).configs/train_anomaly.yaml: anomaly model config (search + score thresholding + MLflow).configs/drift.yaml: reference/current dataset and Evidently report output.
Install:
pip install -e .
pip install -e ".[mlops]" # optional extras for full MLOps stackRun pipelines:
python -m colombia_subsidy_ml build-dataset --config configs/dataset.yaml
python -m colombia_subsidy_ml train --config configs/train_cascade.yaml
python -m colombia_subsidy_ml train-anomaly --config configs/train_anomaly.yaml
python -m colombia_subsidy_ml evaluate --config configs/train_cascade.yaml
python -m colombia_subsidy_ml predict --config configs/train_cascade.yaml --input data/processed/Base_Modelo_Subsidios.csv --output artifacts/predictions.csv
python -m colombia_subsidy_ml drift-check --config configs/drift.yaml
python -m colombia_subsidy_ml compile-kubeflow --output artifacts/kubeflow/subsidy_pipeline.yamlThe repository now ships a multi-stage Dockerfile with dedicated targets:
| Docker target | Purpose | Included dependencies | Default command |
|---|---|---|---|
train |
Lightweight training/evaluation image | Base ML stack (requirements.txt) |
subsidy-ml train --config configs/train_cascade.yaml |
api |
Serving image | Full MLOps stack (requirements.txt + requirements-mlops.txt) |
subsidy-ml serve-api --host 0.0.0.0 --port 8000 |
mlops |
Jobs and orchestration tooling | Full MLOps stack | bash |
Build examples:
docker build --target train -t colombia-subsidy-ml:train .
docker build --target api -t colombia-subsidy-ml:api .
docker build --target mlops -t colombia-subsidy-ml:mlops .docker-compose.yml defines production-friendly profiles:
api: FastAPI online inference service.jobs: one-off training/anomaly/drift jobs.tracking: local MLflow server.
Examples:
# API serving
docker compose up api
# Run training jobs
docker compose --profile jobs run --rm train-cascade
docker compose --profile jobs run --rm train-anomaly
# Drift monitoring job
docker compose --profile jobs run --rm drift-check
# MLflow tracking server
docker compose --profile tracking up mlflowdvc reproStages:
- dataset build
- cascade training
- anomaly training
- cascade evaluation
- drift monitoring
python -m colombia_subsidy_ml compile-kubeflow --output artifacts/kubeflow/subsidy_pipeline.yamlMLflow is optional and controlled by each YAML config under mlflow::
enabledexperiment_nametracking_uritags
When enabled, runs log:
- flattened params
- validation/test metrics
- generated artifacts (models, metadata, reports)
Start API:
python -m colombia_subsidy_ml serve-api --host 0.0.0.0 --port 8000Endpoints:
GET /healthGET /metadataPOST /predict
Docs:
- Swagger:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Optional artifact override:
export SUBSIDY_ARTIFACTS_DIR=artifacts/cascadeRun drift detection:
python -m colombia_subsidy_ml drift-check --config configs/drift.yamlOutputs:
artifacts/drift/drift_report.htmlartifacts/drift/drift_report.jsonartifacts/drift/drift_summary.json
Typical outputs:
artifacts/cascade/(preprocessor, cascade model, metadata, metrics, split indices)artifacts/anomaly/(preprocessor, anomaly model, metadata, metrics, split indices)artifacts/predictions.csvartifacts/drift/*
pytest -qOriginal notebooks are kept for traceability:
notebooks/Subsidy Analysis.ipynbnotebooks/Full Maching Learning Modeling.ipynb
- Add CI pipeline (lint, tests, build, security scan).
- Add model registry promotion rules per environment.
- Add scheduled drift checks and alerting integration.
- Add canary/champion-challenger deployment policy.
Apache License 2.0. Feel free to use the code and all the pipelines




