Validated clinical case vignette collection and processing system for CollectiveGood applications. Aggregates cases from academic sources, transforms to a canonical schema with SNOMED/LOINC/RxNorm coding, and exports to multiple platforms including the Ethiopia CME app and SAIF validation platform.
- 30,000+ validated cases from Gold-tier academic sources
- Multi-stage enrichment pipeline with regex and LLM extraction
- Medical coding with LOINC (180+ codes), SNOMED-CT (100+ diagnoses), RxNorm (150+ medications)
- Quality scoring with richness metrics and platform eligibility checks
- Multi-platform export to Ethiopia CME, SAIF, and FHIR R4 formats
┌─────────────┐ ┌───────────────┐ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ INGEST │───▸│ TRANSFORM │───▸│ ENRICH │───▸│ VALIDATE │───▸│ EXPORT │
│ │ │ │ │ │ │ │ │ │
│ HuggingFace │ │ Canonical │ │ Regex + LLM │ │ Quality │ │ Ethiopia │
│ PubMed/PMC │ │ Schema │ │ Medical │ │ Scoring │ │ SAIF │
│ Journals │ │ Pydantic v2 │ │ Coding │ │ Tier │ │ FHIR R4 │
└─────────────┘ └───────────────┘ └─────────────┘ └──────────────┘ └─────────────┘
# Clone the repository
git clone https://github.com/DHEPLab/clinical-case-pipeline.git
cd clinical-case-pipeline
# Install with uv (recommended)
uv pip install -e .
# Or with pip
pip install -e .
# Install optional dependencies
pip install -e ".[dev]" # Development tools
pip install -e ".[llm]" # LLM extraction (anthropic, openai)# 1. Load Gold tier cases from HuggingFace
case-pipeline ingest medcasereasoning --max 100
# 2. Enrich with structured field extraction
case-pipeline enrich --input data/processed/cases.json
# 3. Validate quality and generate report
case-pipeline validate --min-richness 70
# 4. Export to platform format
case-pipeline export ethiopia --min-richness 70
# View statistics
case-pipeline stats# Load specific dataset
case-pipeline ingest medcasereasoning --max 100 --output data/cases.json
# Load all available datasets
case-pipeline ingest all
# Available datasets: medcasereasoning, medqa, medxpertqa, all# Regex extraction (fast, free)
case-pipeline enrich --input data/cases.json
# With LLM extraction (more thorough, requires ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY="sk-..."
case-pipeline enrich --use-llm --batch-size 10
# Preview without saving
case-pipeline enrich --dry-run --max 10Extraction targets:
- Chief complaint
- Vital signs (BP, HR, RR, Temp, SpO2)
- Lab results (CBC, BMP, LFTs)
- Physical exam findings
- Past medical history
- Medications
- Medical coding (LOINC, SNOMED, RxNorm)
# Basic validation
case-pipeline validate --input data/cases.json
# With quality thresholds
case-pipeline validate --min-richness 70 --require-reasoning
# Generate report
case-pipeline validate --output reports/quality.json# Ethiopia CME app format (Supabase-ready)
case-pipeline export ethiopia --min-richness 70
# SAIF platform format (FHIR-compatible)
case-pipeline export saif
# FHIR R4 Bundle format
case-pipeline export fhir
# Export to all formats
case-pipeline export all --output-dir data/validated# Preview available articles
case-pipeline scrape discover nejm --max 10
# Scrape cases from journal
case-pipeline scrape journal nejm --max 50
# With LLM extraction fallback
case-pipeline scrape journal nejm --use-llm
# Check storage status
case-pipeline scrape status# View case statistics
case-pipeline stats
# Show pipeline info and available sources
case-pipeline info| Dataset | Cases | Tier | Description | License |
|---|---|---|---|---|
| MedCaseReasoning | 14,489 | Gold | Clinician-authored cases with diagnostic reasoning | CC-BY-4.0 |
| MedQA | 12,723 | Gold | USMLE exam questions | Open-Research |
| MedXpertQA | 4,460 | Gold | Expert-curated diagnostic cases | Open |
- Gold: Has diagnosis AND diagnostic reasoning (clinician-authored or expert-validated)
- Silver: Has diagnosis with reasoning but from less rigorous sources
- Bronze: Has diagnosis but NO reasoning (needs clinician validation)
- Raw: Unknown validation status (needs review)
After enrichment pipeline improvements (2026-01-30):
| Metric | Before | After | Improvement |
|---|---|---|---|
| Chief Complaint Fill Rate | 3% | 87% | +84pp |
| Mean Richness Score | - | 75.4 | - |
| Ethiopia CME Eligible | 0% | 87.8% | +87.8pp |
| SAIF Eligible | - | 85%+ | - |
The canonical schema uses standard terminologies for interoperability:
| Standard | Purpose | Examples |
|---|---|---|
| SNOMED-CT | Diagnoses, findings | 233604007 (Pneumonia), 44054006 (Diabetes) |
| LOINC | Labs, vitals | 2339-0 (Glucose), 8867-4 (Heart rate) |
| RxNorm | Medications | 1049221 (Metformin 500mg), 197361 (Lisinopril) |
| ICD-10 | Classification | J18.9 (Pneumonia), E11.9 (Type 2 DM) |
All cases are transformed to a unified schema:
from clinical_case_pipeline.transform import ClinicalCaseCanonical
# Each case has:
case.id # Unique identifier
case.source # Provenance (dataset, license, citation)
case.vignette_text # Full case narrative
case.patient_demographics # Age, sex, location
case.chief_complaint # Primary presenting complaint
case.diagnosis # SNOMED/ICD-10 coded diagnosis
case.diagnostic_reasoning # Explanation (required for Gold tier)
case.vital_signs # LOINC-coded measurements
case.lab_results # LOINC-coded results
case.medications # RxNorm-coded medications
case.validation_tier # gold, silver, bronze, rawMatches the Case interface for the Ethiopia Clinical Training app:
- Supabase-ready JSON
- Includes difficulty rating, specialty tags
- Amharic translation support planned
FHIR-compatible format for CollectiveGood's SAIF validation platform:
- SNOMED/LOINC coded observations
- OMOP CDM concepts for research
- Validation tier metadata
Standard FHIR R4 Bundle format:
- Patient, Condition, Observation resources
- Suitable for EHR integration
| Platform | Min Score | Required Fields |
|---|---|---|
| Ethiopia CME | 70 | Chief complaint, HPI, demographics, diagnosis, reasoning |
| SAIF Validation | 60 | Demographics, presentation, diagnosis |
| Research | 50 | Presentation, diagnosis |
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=clinical_case_pipeline
# Lint
ruff check src/
# Type check
mypy src/src/clinical_case_pipeline/
├── ingest/ # Dataset loaders
│ ├── huggingface_loader.py
│ └── scrapers/ # Journal scrapers
├── transform/ # Canonical schema
│ └── schemas.py # Pydantic models
├── enrich/ # Enrichment pipeline
│ ├── enricher.py # Orchestrator
│ ├── regex_patterns.py
│ ├── llm_extractor.py
│ └── medical_coding.py
├── validate/ # Quality checks
│ └── quality_checks.py
├── export/ # Platform adapters
│ ├── ethiopia_adapter.py
│ └── saif_adapter.py
└── cli.py # Typer CLI
MIT License - See LICENSE file for details.
- Ethiopia Clinical Training - CME platform
- CG Validation Demo API - SAIF validation platform
Developed by DHEPLab for CollectiveGood