Skip to content

Open-vocabulary temporal localization in endoscopic video with explainability & uncertainty.

Notifications You must be signed in to change notification settings

Soheil-jafari/Language-Guided-Endoscopy-Localization

Repository files navigation

An Explainable, Language-Guided Framework for Open-Set Temporal Localization in Endoscopic Videos

Python 3.9+ PyTorch Transformers License: MIT

This repository contains the official implementation for the dissertation titled: β€œAn Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videos” by Soheil Jafarifard Bidgoli (MSc Computer Science, Aston University).

The project introduces a novel framework for localizing arbitrary, language-described events in long-form surgical videos, moving beyond the limitations of traditional closed-set recognition models.

πŸš€ Overview

Surgical and endoscopic procedures generate vast amounts of video data. Clinicians often need to find specific moments, but traditional AI models can only recognize a fixed, predefined set of events (e.g., "Phase 1," "Phase 2"). This framework breaks that limitation by enabling open-vocabulary temporal localization. Users can query long, untrimmed videos with free-form natural language to find relevant events (e.g., β€œfind when the grasper retracts the gallbladder”).

Our framework is built on three pillars:

  1. Open-Vocabulary Localization: Leverages powerful vision-language models to understand and locate events described by arbitrary text queries, not just fixed labels.
  2. Architectural Scalability: Employs a hybrid Transformer and Structured State Space Model (SSM) architecture to efficiently process long-form surgical videos, overcoming the quadratic complexity of traditional attention mechanisms.
  3. Trustworthiness & Explainability: Integrates Evidential Deep Learning to quantify model uncertainty and provides visual attention maps to explain its predictions, fostering clinical trust and safety.

Visualizing Temporal Localization

The following demonstration shows the framework successfully localizing the "Calot triangle dissection phase" in a 40-minute untrimmed procedure based solely on a natural language query.

Surgical Localization Demo Query: "Calot triangle dissection phase" | Peak Confidence: 99.9%

✨ Key Features

  • End-to-End Open-Vocabulary TAL: A complete pipeline from data preprocessing to language-guided inference for surgical video analysis.
  • Flexible Vision Backbones: Supports both a powerful MΒ²CRL pretrained video transformer and a highly efficient EndoMamba (SSM) backbone for long-sequence modeling.
  • Parameter-Efficient Fine-Tuning (PEFT): Uses Low-Rank Adaptation (LoRA) to efficiently adapt a pretrained CLIP text encoder to the surgical domain with minimal computational cost.
  • Advanced Temporal Modeling: Features a state-of-the-art Mamba-based Temporal Head that scales linearly with sequence length, making it ideal for hour-long procedural videos.
  • Bi-Level Consistency Loss: A novel training objective that enforces temporal consistency at both the semantic and spatial levels using optical flow (RAFT) to regularize the model.
  • Uncertainty Quantification: Implements Evidential Deep Learning (EDL) to allow the model to express its own confidence, reliably identifying out-of-distribution or ambiguous events.
  • Built-in Explainability (XAI): Generates cross-modal attention maps to visualize which parts of a frame the model focused on to make its decision, a critical feature for clinical validation.
  • Comprehensive Baseline Suite: Includes code and instructions to benchmark against canonical baselines like CLIP, X-CLIP, Moment-DETR, and TeCNO.

πŸ—οΈ Framework Architecture

The system is a multi-stage pipeline designed to process spatial, semantic, and temporal information through specialized components:

  1. Vision Backbone: A pretrained MΒ²CRL model extracts a grid of powerful visual feature vectors from each video frame.
  2. Text Encoder: A LoRA-adapted CLIP text encoder processes the natural language query into a semantic feature vector.
  3. Language-Guided Fusion Head: A cross-modal transformer uses attention to fuse the visual and textual features. It identifies relevant spatial regions in the frame corresponding to the query, outputting initial raw_scores, intermediate features for the consistency loss, and attention_weights for XAI.
  4. Temporal Head (SSM/Mamba): This highly efficient head analyzes the sequence of fused features from the entire clip. It models long-range context to smooth predictions and fill gaps, producing final, contextually-aware refined_scores.
  5. Uncertainty & Prediction Head: In its SOTA configuration, this head uses Evidential Deep Learning to output not just a final score but also the parameters of a Beta distribution (evidential_output), allowing for robust uncertainty quantification.

πŸ“‚ Repository Structure

Language-Guided-Endoscopy-Localization/
β”‚
β”œβ”€β”€ backbone/                         # Vision backbones
β”‚   β”œβ”€β”€ endomamba.py                  # EndoMamba (SSM-based backbone)
β”‚   └── vision_transformer.py         # ViT-based backbone (MΒ²CRL, etc.)
β”‚
β”œβ”€β”€ checkpoints/                      # Saved checkpoints and logs
β”‚
β”œβ”€β”€ comparison_models/                # Baseline and benchmark models
β”‚   β”œβ”€β”€ clip_baseline/
β”‚   β”‚   └── clip_baseline.py          # CLIP zero-shot / linear probe
β”‚   β”‚
β”‚   β”œβ”€β”€ Moment-DETR/                  # Moment-DETR temporal grounding
β”‚   β”‚   β”œβ”€β”€ run_evaluation.py
β”‚   β”‚   β”œβ”€β”€ run_feature_extraction.py
β”‚   β”‚   β”œβ”€β”€ run_preprocessing.py
β”‚   β”‚   β”œβ”€β”€ run_training.py
β”‚   β”‚   └── moment_detr_module/
β”‚   β”‚       β”œβ”€β”€ __init__.py
β”‚   β”‚       β”œβ”€β”€ configs.py
β”‚   β”‚       β”œβ”€β”€ dataset.py
β”‚   β”‚       β”œβ”€β”€ engine.py
β”‚   β”‚       β”œβ”€β”€ loss.py
β”‚   β”‚       β”œβ”€β”€ matcher.py
β”‚   β”‚       β”œβ”€β”€ modeling.py
β”‚   β”‚       β”œβ”€β”€ position_encoding.py
β”‚   β”‚       β”œβ”€β”€ transformer.py
β”‚   β”‚       β”œβ”€β”€ utils.py
β”‚   β”‚       └── README.md
β”‚   β”‚
β”‚   └── xclip_baseline/               # X-CLIP video-language baseline
β”‚       β”œβ”€β”€ train_xclip.py
β”‚       β”œβ”€β”€ eval_xclip.py
β”‚       β”œβ”€β”€ infer_xclip.py
β”‚       β”œβ”€β”€ requirements.txt
β”‚       β”œβ”€β”€ project_config.py
β”‚       β”œβ”€β”€ README_XCLIP.md
β”‚       └── xclip_package/
β”‚           └── xclip/
β”‚               β”œβ”€β”€ __init__.py
β”‚               β”œβ”€β”€ data.py
β”‚               β”œβ”€β”€ losses.py
β”‚               β”œβ”€β”€ metrics.py
β”‚               β”œβ”€β”€ model.py
β”‚               └── utils.py
β”‚
β”œβ”€β”€ dataset_preprocessing/            # Preprocessing for Cholec80 dataset
β”‚   β”œβ”€β”€ create_splits.py
β”‚   β”œβ”€β”€ extract_cholec80_frames.py
β”‚   └── prepare_cholec80.py
β”‚
β”œβ”€β”€ pretrained/                       # Pretrained model weights
β”‚   └── checkpoint.pth
β”‚
β”œβ”€β”€ dataset.py                        # Dataset wrapper
β”œβ”€β”€ inference.py                      # Inference script (language-guided)
β”œβ”€β”€ models.py                         # Main model components
β”œβ”€β”€ project_config.py                 # Config file for project settings
β”œβ”€β”€ train.py                          # Training entry point
β”‚
β”œβ”€β”€ README.md                         # Project documentation
└── .gitignore

πŸ§‘β€βš•οΈ Dataset: Cholec80

This framework is developed and evaluated on the Cholec80 dataset, which contains 80 videos of laparoscopic cholecystectomy procedures. Our preprocessing pipeline transforms this dataset into a format suitable for open-vocabulary learning.

Preprocessing Pipeline

  1. Frame Extraction: Videos are decoded into individual frames at a specified sampling rate.
    python dataset_preprocessing/extract_cholec80_frames.py --cholec80_videos_dir /path/to/videos --output_frames_dir /path/to/frames
  2. Create Data Splits: The 80 videos are randomly partitioned into training, validation, and test sets to ensure fair evaluation.
    python dataset_preprocessing/create_splits.py --video_dir /path/to/videos
  3. Generate Language Triplets: The core preprocessing step. This script reads the official phase and tool annotations and generates a CSV file of (frame_path, text_query, relevance_label) triplets. This creates positive and negative examples for training the vision-language alignment.
    # Run for each split
    python dataset_preprocessing/prepare_cholec80.py --split train
    python dataset_preprocessing/prepare_cholec80.py --split val
    python dataset_preprocessing/prepare_cholec80.py --split test

βš™οΈ Setup and Usage

Installation

  1. Clone the repository:

    git clone https://github.com/soheil-jafari/language-guided-endoscopy-localization.git
    cd language-guided-endoscopy-localization
  2. Create a Python environment and install dependencies. We recommend using Conda.

    conda create -n endo-tal python=3.9 -y
    conda activate endo-tal
    pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
    pip install -r requirements.txt 
  3. Configuration: Before running any scripts, review and update the paths in project_config.py to match your system's directory structure.

Training

The main training script train.py handles model training with support for various configurations controlled by project_config.py and command-line arguments.

  • To train the model from scratch or fine-tune:
    python train.py
  • To fine-tune from an existing checkpoint:
    python train.py --finetune_from /path/to/your/checkpoint.pth
  • To run in debug mode on a small data subset:
    python train.py --debug
  • To adjust the training subset size (e.g., 50% of the data):
    python train.py --subset 0.5

Inference

Use inference.py to run a trained model on a video to localize a specific language query.

python inference.py \
    --video_path /path/to/your/video.mp4 \
    --text_query "a grasper is present" \
    --checkpoint_path /path/to/your/best_model.pth

πŸ“ˆ Performance & Key Results

The framework was evaluated on the held-out Cholec80 test split (8 full-length videos). Despite the computational constraints limiting training to only 4 epochs, the model demonstrated state-of-the-art potential in frame-level discrimination.

Quantitative Benchmarking

Our framework significantly outperforms general-domain baselines when applied to the specialized surgical environment.

Model AUROC (Discrimination) AUPRC (Precision-Recall) Best F1-Score
Proposed Framework (4 Epochs) 0.933 0.887 0.820
CLIP Baseline (Linear Probe) 0.52 0.08 0.10
X-CLIP Baseline 0.50 0.07 0.12
Moment-DETR 0.51 0.06 0.11

Note: Baselines were trained and evaluated under identical conditions using the same Cholec80 splits.

Visualizing Model Accuracy

The following Precision-Recall curve demonstrates the framework's ability to maintain high precision across various recall levels, achieving an AUPRC of 0.887.

Precision-Recall Curve

Qualitative Success

In a focused case study on the "Calot triangle dissection phase" (Video 05), the model successfully identified the correct temporal neighborhood, achieving a peak confidence score of 0.9994.

Temporal Localization Analysis

The timeline below illustrates the model's activation scores across Video 05. Note the distinct probability peak aligned with the "Calot triangle dissection phase" ground truth.

Temporal Localization Timeline

πŸ“Š Baselines and Comparisons This repository includes the necessary code and instructions to benchmark our framework against three key families of models:

  • General Vision-Language Models: For open-set, text-driven evaluation.
  • CLIP: Zero-shot and linear-probe per-frame relevance scoring.
  • X-CLIP: A powerful video-language model for scoring short clips.
  • Temporal Grounding Models: For the direct task of localizing events from text.
  • Moment-DETR: Predicts start/end boundaries from a language query.
  • Surgical Specialist Models: Closed-set baselines trained specifically for Cholec80.
  • TeCNO: A temporal convolutional network for surgical phase recognition.
  • The code for these baselines can be found in the comparison_models/ directory. Each subfolder contains a README with specific instructions for running that model.

πŸ” Explainability & Trustworthiness

In high-stakes clinical environments, "black-box" predictions are insufficient. This framework integrates designed-in trustworthiness through two key mechanisms:

1. Uncertainty Quantification (EDL)

The system employs Evidential Deep Learning (EDL) to predict the parameters of a Beta distribution ($\alpha, \beta$) for every frame.

  • Inverse Uncertainty: The total evidence ($S_t = \alpha_t + \beta_t$) provides an explicit measure of model confidence.
  • Safety: This allows the system to flag ambiguous or out-of-distribution events that require human surgeon review.

2. Visual Rationales

The framework generates Cross-Modal Attention Maps to visualize the model's reasoning.

  • Focus: These heatmaps highlight the specific surgical tools (e.g., clip applier, grasper) or anatomical structures the model prioritized during a query.
  • Validation: This provides clinicians with interpretable evidence that aligns AI predictions with familiar visual cues.

πŸ“š Citation

If you use this framework or ideas from our work in your research, please cite the following dissertation:

Code snippet

@mastersthesis{jafarifard2025,
  title={An Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videos},
  author={Soheil Jafarifard Bidgoli},
  school={Aston University},
  year={2025}
}

🀝 Acknowledgements

This work was completed as part of the CS4700 Dissertation for the MSc in Computer Science at Aston University.

Supervisor: Dr. Zhuangzhuang Dai.

This project builds upon the foundational work of the Cholec80 dataset creators and the authors of MΒ²CRL, VideoMamba, CLIP, Moment-DETR, and other referenced works.

About

Open-vocabulary temporal localization in endoscopic video with explainability & uncertainty.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages