This project processes breast histology slides to classify samples based on copy number (CN) events, specifically focusing on 1q gain detection. This training pipeline processes Hamamatsu NDPI slide scans with NDPA annotations taken after laser capture microdissection (LCM) subsequently whole genome sequenced to determine copy number changes.
The project consists of these training scripts that work in sequence to:
- Pair slide images (.ndpi) with their most recent free-form annotations (.ndpa) (
01_get_ndpa_pairs.py) - Extract a cropped image from the cut area (determined from the coordinates of the annotated regions) (
02_extract_cuts_from_slide_scan.py) - Map samples to copy number events (1q gain or no CN event) (
03_map_slide_to_CN.py) - Train a machine learning model to classify copy number events (like 1q gain) (
04_train_model.py)
Purpose: Automatically discovers and pairs Hamamatsu NDPI slide files with their corresponding NDPA annotation files.
What it does:
- Scans a base directory (
/Volumes/casm-lcm/tb14/Breast/) for NDPI files - Filters for slides containing "slide" and "PD" in the path
- Matches each NDPI file with its corresponding NDPA annotation file based on overlapping basename
- Handles filename variations (e.g.,
.editedsuffixes, ISO-8601 date stamps) - Selects the most recently modified NDPA file if multiple exist
- Outputs a JSON mapping of NDPI paths to NDPA paths
Output: ndpi_ndpa_pairs.json - A JSON file mapping NDPI slide paths to their annotation files
Usage:
python 01_get_ndpa_pairs.pyPurpose: Extracts cropped tissue regions from NDPI slides based on NDPA polygon annotations.
What it does:
- Reads NDPA XML files containing free-form polygon annotations
- Parses annotation coordinates and converts them to pixel coordinates
- Extracts a cropped image for each of the annotated regions
- Saves crops as WebP images (80% quality) for efficient storage
- Dynamically handles coordinate transformations and slide offsets based on the slide metadata
Usage:
# Batch mode (recommended)
python 02_extract_cuts_from_slide_scan.py \
--pairs Output/ndpi_ndpa_pairs.json \
--out Output/ndpi_crops \
--verboseOutput: Directory structure with subdirectories for each slide containing WebP crops
Purpose: Associates extracted tissue crops with copy number event data and patient metadata.
What it does:
- Reads copy number events spreadsheet (contains sample IDs and CN events)
- Reads master section details spreadsheet (contains patient metadata)
- Scans the cropped images directory to find available tissue samples
- Categorizes samples into four groups:
- "tumour": tumour samples
- "1q": Samples with 1q gain events
- "16q": Samples with 16q loss events
- "1q or 16q": Samples with 1q gain or 16q loss events
- "no CN": Samples with no copy number events
- Associates each sample with its metadata and image paths
- Creates a hierarchical JSON structure for downstream analysis
Input Requirements:
- Copy number events spreadsheet: Must contain
SampleandEventcolumns - Master section details spreadsheet: Must contain
sampleID,Patient_Category,PD_ID,Slide_Description,Tissue_Type - Cropped images directory: WebP files from script 2
Output Structure:
{
"tumour": {
"<sampleID>": {
"paths": ["path/to/image1.webp", "path/to/image2.webp"],
"Patient_Category": "...",
"PD_ID": "...",
"Slide_Description": "...",
"Tissue_Type": "..."
}
},
"normal": {
"<sampleID>": { ... }
}
}Usage:
python 03_map_slide_to_CN.py \
--cn Data/copy_number_events_GRCh38_211108.xlsx \
--master Data/Breast_LCM_Section_Details_210323.xlsx \
--slides Output/ndpi_crops \
--out_dir Output/To run the entire pipeline:
# Step 1: Pair NDPI slides with annotations
python 01_get_ndpa_pairs.py
# Step 2: Extract tissue crops
python 02_extract_cuts_from_slide_scan.py \
--pairs Output/ndpi_ndpa_pairs.json \
--out Output/ndpi_crops
# Step 3: Map to copy number events
python 03_map_slide_to_CN.py \
--cn Data/copy_number_events_GRCh38_211108.xlsx \
--master Data/Breast_LCM_Section_Details_210323.xlsx \
--slides Output/ndpi_crops \
--out_dir Output/- Python 3.x
- pandas: For spreadsheet processing
- openpyxl: For Excel file reading
- openslide: For NDPI slide reading
- Pillow (PIL): For image processing and WebP output
- pathlib: For file path handling
The project expects the following directory structure:
├── Data/
│ ├── copy_number_events_GRCh38_211108.xlsx
│ └── Breast_LCM_Section_Details_210323.xlsx
├── Output/
│ ├── ndpi_ndpa_pairs.json # Generated by script 1
│ ├── ndpi_crops/ # Generated by script 2
│ ├── tumour_vs_normal.json # Generated by script 3
│ ├── normal_CN_vs_noCN.json # Generated by script 3
│ ├── normal_1q_vs_noCN.json # Generated by script 3
│ └── normal_1q_or_16q_loss_vs_noCN.json # Generated by script 3
- The pipeline is designed for breast epithelium slides
- Sample IDs must match between the spreadsheets and image filenames
- The system automatically handles filename variations and coordinate transformations
- Output images are optimized for storage efficiency using WebP format
- The final JSON mapping can be used for training machine learning models to classify 1q gain events
Using the provided pipeline, we trained a UNI ViT-L/16 classifier for tumour vs normal on breast LCM crops and achieved:
Test F1 (normal): 0.979
Command used:
python 04_train_UNI_on_CN.py \
--json Output/tumour_vs_normal.json \
--root Output/ndpi_crops \
--out_dir Output/ClassificationTraining/uni_tumour_vs_normal \
--epochs 20 --batch 24 --patience 10Run summary:
- Best validation F1 (normal): 0.986
- Test F1 (macro): 0.979
- Epochs trained: 20
We trained a UNI ViT-L/16 classifier to distinguish normal samples with 1q gain from those with no CN event.
Test F1 (normal_1q): 0.993
Command used:
python 04_train_UNI_on_CN.py \
--json Output/normal_1q_vs_noCN.json \
--root Output/ndpi_crops \
--out_dir Output/ClassificationTraining/uni_1q_v_noCN \
--epochs 30 --batch 24 --patience 10Run summary:
- Best validation F1 (normal_1q): 0.997
- Test F1 (normal_1q): 0.993
- Test F1 (macro): 0.993
- Epochs trained: 29 (early stopping)
