Skip to content

This project uses OCR and machine learning to extract CBC values from reports and predict urgency levels. As of now, it supports image/pdf inputs, manual corrections, and SHAP explainability. Ideal for medical AI, healthcare OCR, and automated lab report analysis.

License

Notifications You must be signed in to change notification settings

antarades/CBC_report_interpreter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CBC Report Interpreter using Explainable AI

An OCR-powered, ML-assisted diagnostic helper for CBC (Complete Blood Count) reports. This Streamlit web app reads CBC reports (image + pdfs), extracts values with advanced OCR, allows user corrections, predicts urgency using a hybrid ML + rule engine, and explains decisions with SHAP.


πŸš€ Features

OCR Extraction (Highly Robust)

  • Uses Tesseract OCR + OpenCV preprocessing for high-accuracy extraction
  • Handles noisy scans, low DPI images, hyphens, separators, and malformed text
  • Fuzzy-matches medical terms like β€œRDW-CV”, β€œRAW CV”, β€œrow cv”, β€œrdw cv”, etc.
  • Smart numerical sanity checks (e.g., CV range 8–25, SD range 30–80)

Editable Values (Double-Check Mechanism)

  • After extraction, users can toggle Edit Mode
  • Manually correct any value before analysis
  • Ensures reliable predictions even if OCR misreads something

Hybrid Urgency Classification

Your system uses two intelligent layers:

  1. Machine Learning Model (RandomForestClassifier)

    • Predicts Normal / Mild / Urgent / Emergency
    • Uses 10 features: HGB, WBC, RBC, PLT, HCT, MCV, MCH, MCHC, RDWSD, RDWCV
  2. Medical Rule Engine

    • Applies domain-based thresholds & weighted deviations

    • Computes severity scores

    • Overrides model when:

      • Model overreacts
      • Rules indicate normal or mild deviation

βœ… Prevents false β€œUrgent” predictions βœ… Ensures safety-first interpretation

SHAP Explainability

Each prediction comes with a SHAP bar plot:

  • Shows which CBC values influenced the ML model
  • Helps doctors and patients understand model reasoning

PDF + Image Support

OCR works on:

  • JPG
  • PNG
  • JPEG
  • PDF

🧠 How It Works

1. Upload CBC report

The app automatically:

  • Reads text using Tesseract
  • Preprocesses images (resize, denoise, sharpen, threshold)
  • Extracts CBC values using fuzzy matching + regex + range logic

2. Edit (Optional)

A toggle appears:

[ ] Edit values

When enabled:

  • Editable numeric fields appear
  • User overrides flow
  • Updated values fed to ML + rules

3. Classification

  • ML predicts urgency
  • Rules evaluate deviation from normal ranges
  • Final decision is combined for safety

4. Results

You get:

  • βœ… Overall urgency
  • βœ… What stands out
  • βœ… What it could mean
  • βœ… What you should do
  • βœ… SHAP feature impact
  • βœ… Downloadable auto-generated PDF report

πŸ“Έ Screenshots


πŸ”Ή 1. Home Page / Upload Screen

Upload Screen

Users upload CBC reports in JPG/PNG/JPEG/PDF formats.


πŸ”Ή 2. Extracted Values (OCR Output)

Extracted Values

Automatically detected CBC readings with an option to edit.


πŸ”Ή 3. Edit Mode (Manual Correction)

Edit Mode

Users can correct misread values before analysis.


πŸ”Ή 4. Final Urgency Result

Urgency Prediction

Combined ML + medical rule-based urgency classification.


πŸ”Ή 5. SHAP Explainability Plot & Downloadable Summary Report Option

SHAP Plot

Visual explanation showing feature importance in model decision and PDF summary including findings and recommendations.


πŸ› οΈ Installation

1. Clone repo

git clone https://github.com/antarades/CBC_report_interpreter.git
cd cbc-urgency-detector

2. Create virtual environment

python -m venv .venv
source .venv/bin/activate          # macOS/Linux
.venv\Scripts\activate             # Windows

3. Install requirements

pip install -r requirements.txt

4. Install Tesseract

Download from: https://github.com/tesseract-ocr/tesseract

Update path inside extractor.py:

pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"

▢️ Running the App

streamlit run app.py

πŸ“‚ Project Structure

project/
β”‚
β”œβ”€β”€ app.py                     # Streamlit UI
β”œβ”€β”€ extractor.py               # OCR + line parsing + fuzzy RDW logic
β”œβ”€β”€ file_predict.py            # Rules, normalization, final decision logic
β”œβ”€β”€ explain_cbc_model.py       # SHAP visualizer script
β”œβ”€β”€ cbc_model.pkl              # Trained RandomForest model
β”œβ”€β”€ label_encoder.pkl          # Label encoder for urgency classes
└── requirements.txt  

πŸ§ͺ Machine Learning Model

Model

  • RandomForestClassifier
  • Trained on labeled CBC dataset (Normal, Mild, Urgent, Emergency)

Features

HGB, WBC, RBC, PLT, HCT, MCV, MCH, MCHC, RDWSD, RDWCV

Safeguards

  • Missing RDW values from old model version handled automatically
  • Rules override unsafe ML predictions

⚠️ Limitations

  • OCR accuracy depends on scan clarity
  • Model should be retrained periodically with better and larger datasets
  • Does not replace medical diagnosis; provides guidance only

βœ… Future Enhancements

  • Vision transformer for OCR
  • Fine-tuned lightweight model for structured table extraction
  • Multi-report batch processing
  • Confidence scoring

About

This project uses OCR and machine learning to extract CBC values from reports and predict urgency levels. As of now, it supports image/pdf inputs, manual corrections, and SHAP explainability. Ideal for medical AI, healthcare OCR, and automated lab report analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published