An OCR-powered, ML-assisted diagnostic helper for CBC (Complete Blood Count) reports. This Streamlit web app reads CBC reports (image + pdfs), extracts values with advanced OCR, allows user corrections, predicts urgency using a hybrid ML + rule engine, and explains decisions with SHAP.
- Uses Tesseract OCR + OpenCV preprocessing for high-accuracy extraction
- Handles noisy scans, low DPI images, hyphens, separators, and malformed text
- Fuzzy-matches medical terms like βRDW-CVβ, βRAW CVβ, βrow cvβ, βrdw cvβ, etc.
- Smart numerical sanity checks (e.g., CV range 8β25, SD range 30β80)
- After extraction, users can toggle Edit Mode
- Manually correct any value before analysis
- Ensures reliable predictions even if OCR misreads something
Your system uses two intelligent layers:
-
Machine Learning Model (RandomForestClassifier)
- Predicts Normal / Mild / Urgent / Emergency
- Uses 10 features:
HGB, WBC, RBC, PLT, HCT, MCV, MCH, MCHC, RDWSD, RDWCV
-
Medical Rule Engine
-
Applies domain-based thresholds & weighted deviations
-
Computes severity scores
-
Overrides model when:
- Model overreacts
- Rules indicate normal or mild deviation
-
β Prevents false βUrgentβ predictions β Ensures safety-first interpretation
Each prediction comes with a SHAP bar plot:
- Shows which CBC values influenced the ML model
- Helps doctors and patients understand model reasoning
OCR works on:
- JPG
- PNG
- JPEG
The app automatically:
- Reads text using Tesseract
- Preprocesses images (resize, denoise, sharpen, threshold)
- Extracts CBC values using fuzzy matching + regex + range logic
A toggle appears:
[ ] Edit values
When enabled:
- Editable numeric fields appear
- User overrides flow
- Updated values fed to ML + rules
- ML predicts urgency
- Rules evaluate deviation from normal ranges
- Final decision is combined for safety
You get:
- β Overall urgency
- β What stands out
- β What it could mean
- β What you should do
- β SHAP feature impact
- β Downloadable auto-generated PDF report
Users upload CBC reports in JPG/PNG/JPEG/PDF formats.
Automatically detected CBC readings with an option to edit.
Users can correct misread values before analysis.
Combined ML + medical rule-based urgency classification.
Visual explanation showing feature importance in model decision and PDF summary including findings and recommendations.
git clone https://github.com/antarades/CBC_report_interpreter.git
cd cbc-urgency-detectorpython -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windowspip install -r requirements.txtDownload from: https://github.com/tesseract-ocr/tesseract
Update path inside extractor.py:
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"streamlit run app.pyproject/
β
βββ app.py # Streamlit UI
βββ extractor.py # OCR + line parsing + fuzzy RDW logic
βββ file_predict.py # Rules, normalization, final decision logic
βββ explain_cbc_model.py # SHAP visualizer script
βββ cbc_model.pkl # Trained RandomForest model
βββ label_encoder.pkl # Label encoder for urgency classes
βββ requirements.txt
- RandomForestClassifier
- Trained on labeled CBC dataset (
Normal,Mild,Urgent,Emergency)
HGB, WBC, RBC, PLT, HCT, MCV, MCH, MCHC, RDWSD, RDWCV
- Missing RDW values from old model version handled automatically
- Rules override unsafe ML predictions
- OCR accuracy depends on scan clarity
- Model should be retrained periodically with better and larger datasets
- Does not replace medical diagnosis; provides guidance only
- Vision transformer for OCR
- Fine-tuned lightweight model for structured table extraction
- Multi-report batch processing
- Confidence scoring




