Skip to content

Clinical case vignette collection and processing system for CollectiveGood applications

Notifications You must be signed in to change notification settings

DHEPLab/clinical-case-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical Case Pipeline

Tests Python License

Validated clinical case vignette collection and processing system for CollectiveGood applications. Aggregates cases from academic sources, transforms to a canonical schema with SNOMED/LOINC/RxNorm coding, and exports to multiple platforms including the Ethiopia CME app and SAIF validation platform.

Key Features

  • 30,000+ validated cases from Gold-tier academic sources
  • Multi-stage enrichment pipeline with regex and LLM extraction
  • Medical coding with LOINC (180+ codes), SNOMED-CT (100+ diagnoses), RxNorm (150+ medications)
  • Quality scoring with richness metrics and platform eligibility checks
  • Multi-platform export to Ethiopia CME, SAIF, and FHIR R4 formats

Pipeline Architecture

┌─────────────┐    ┌───────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│   INGEST    │───▸│   TRANSFORM   │───▸│   ENRICH    │───▸│   VALIDATE   │───▸│   EXPORT    │
│             │    │               │    │             │    │              │    │             │
│ HuggingFace │    │ Canonical     │    │ Regex + LLM │    │ Quality      │    │ Ethiopia    │
│ PubMed/PMC  │    │ Schema        │    │ Medical     │    │ Scoring      │    │ SAIF        │
│ Journals    │    │ Pydantic v2   │    │ Coding      │    │ Tier         │    │ FHIR R4     │
└─────────────┘    └───────────────┘    └─────────────┘    └──────────────┘    └─────────────┘

Quick Start

Installation

# Clone the repository
git clone https://github.com/DHEPLab/clinical-case-pipeline.git
cd clinical-case-pipeline

# Install with uv (recommended)
uv pip install -e .

# Or with pip
pip install -e .

# Install optional dependencies
pip install -e ".[dev]"    # Development tools
pip install -e ".[llm]"    # LLM extraction (anthropic, openai)

Basic Workflow

# 1. Load Gold tier cases from HuggingFace
case-pipeline ingest medcasereasoning --max 100

# 2. Enrich with structured field extraction
case-pipeline enrich --input data/processed/cases.json

# 3. Validate quality and generate report
case-pipeline validate --min-richness 70

# 4. Export to platform format
case-pipeline export ethiopia --min-richness 70

# View statistics
case-pipeline stats

CLI Commands

ingest - Load Datasets

# Load specific dataset
case-pipeline ingest medcasereasoning --max 100 --output data/cases.json

# Load all available datasets
case-pipeline ingest all

# Available datasets: medcasereasoning, medqa, medxpertqa, all

enrich - Extract Structured Fields

# Regex extraction (fast, free)
case-pipeline enrich --input data/cases.json

# With LLM extraction (more thorough, requires ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY="sk-..."
case-pipeline enrich --use-llm --batch-size 10

# Preview without saving
case-pipeline enrich --dry-run --max 10

Extraction targets:

  • Chief complaint
  • Vital signs (BP, HR, RR, Temp, SpO2)
  • Lab results (CBC, BMP, LFTs)
  • Physical exam findings
  • Past medical history
  • Medications
  • Medical coding (LOINC, SNOMED, RxNorm)

validate - Quality Checks

# Basic validation
case-pipeline validate --input data/cases.json

# With quality thresholds
case-pipeline validate --min-richness 70 --require-reasoning

# Generate report
case-pipeline validate --output reports/quality.json

export - Platform Export

# Ethiopia CME app format (Supabase-ready)
case-pipeline export ethiopia --min-richness 70

# SAIF platform format (FHIR-compatible)
case-pipeline export saif

# FHIR R4 Bundle format
case-pipeline export fhir

# Export to all formats
case-pipeline export all --output-dir data/validated

scrape - Journal Scrapers

# Preview available articles
case-pipeline scrape discover nejm --max 10

# Scrape cases from journal
case-pipeline scrape journal nejm --max 50

# With LLM extraction fallback
case-pipeline scrape journal nejm --use-llm

# Check storage status
case-pipeline scrape status

stats & info - Diagnostics

# View case statistics
case-pipeline stats

# Show pipeline info and available sources
case-pipeline info

Available Datasets

Dataset Cases Tier Description License
MedCaseReasoning 14,489 Gold Clinician-authored cases with diagnostic reasoning CC-BY-4.0
MedQA 12,723 Gold USMLE exam questions Open-Research
MedXpertQA 4,460 Gold Expert-curated diagnostic cases Open

Validation Tiers

  • Gold: Has diagnosis AND diagnostic reasoning (clinician-authored or expert-validated)
  • Silver: Has diagnosis with reasoning but from less rigorous sources
  • Bronze: Has diagnosis but NO reasoning (needs clinician validation)
  • Raw: Unknown validation status (needs review)

Performance Metrics

After enrichment pipeline improvements (2026-01-30):

Metric Before After Improvement
Chief Complaint Fill Rate 3% 87% +84pp
Mean Richness Score - 75.4 -
Ethiopia CME Eligible 0% 87.8% +87.8pp
SAIF Eligible - 85%+ -

Medical Coding Standards

The canonical schema uses standard terminologies for interoperability:

Standard Purpose Examples
SNOMED-CT Diagnoses, findings 233604007 (Pneumonia), 44054006 (Diabetes)
LOINC Labs, vitals 2339-0 (Glucose), 8867-4 (Heart rate)
RxNorm Medications 1049221 (Metformin 500mg), 197361 (Lisinopril)
ICD-10 Classification J18.9 (Pneumonia), E11.9 (Type 2 DM)

Canonical Schema

All cases are transformed to a unified schema:

from clinical_case_pipeline.transform import ClinicalCaseCanonical

# Each case has:
case.id                    # Unique identifier
case.source                # Provenance (dataset, license, citation)
case.vignette_text         # Full case narrative
case.patient_demographics  # Age, sex, location
case.chief_complaint       # Primary presenting complaint
case.diagnosis             # SNOMED/ICD-10 coded diagnosis
case.diagnostic_reasoning  # Explanation (required for Gold tier)
case.vital_signs           # LOINC-coded measurements
case.lab_results           # LOINC-coded results
case.medications           # RxNorm-coded medications
case.validation_tier       # gold, silver, bronze, raw

Platform Export Formats

Ethiopia CME

Matches the Case interface for the Ethiopia Clinical Training app:

  • Supabase-ready JSON
  • Includes difficulty rating, specialty tags
  • Amharic translation support planned

SAIF Platform

FHIR-compatible format for CollectiveGood's SAIF validation platform:

  • SNOMED/LOINC coded observations
  • OMOP CDM concepts for research
  • Validation tier metadata

FHIR R4

Standard FHIR R4 Bundle format:

  • Patient, Condition, Observation resources
  • Suitable for EHR integration

Quality Thresholds

Platform Min Score Required Fields
Ethiopia CME 70 Chief complaint, HPI, demographics, diagnosis, reasoning
SAIF Validation 60 Demographics, presentation, diagnosis
Research 50 Presentation, diagnosis

Development

Setup

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=clinical_case_pipeline

# Lint
ruff check src/

# Type check
mypy src/

Project Structure

src/clinical_case_pipeline/
├── ingest/              # Dataset loaders
│   ├── huggingface_loader.py
│   └── scrapers/        # Journal scrapers
├── transform/           # Canonical schema
│   └── schemas.py       # Pydantic models
├── enrich/              # Enrichment pipeline
│   ├── enricher.py      # Orchestrator
│   ├── regex_patterns.py
│   ├── llm_extractor.py
│   └── medical_coding.py
├── validate/            # Quality checks
│   └── quality_checks.py
├── export/              # Platform adapters
│   ├── ethiopia_adapter.py
│   └── saif_adapter.py
└── cli.py               # Typer CLI

License

MIT License - See LICENSE file for details.

Related Projects


Developed by DHEPLab for CollectiveGood

About

Clinical case vignette collection and processing system for CollectiveGood applications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages