This repository contains the official implementation for the dissertation titled: βAn Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videosβ by Soheil Jafarifard Bidgoli (MSc Computer Science, Aston University).
The project introduces a novel framework for localizing arbitrary, language-described events in long-form surgical videos, moving beyond the limitations of traditional closed-set recognition models.
Surgical and endoscopic procedures generate vast amounts of video data. Clinicians often need to find specific moments, but traditional AI models can only recognize a fixed, predefined set of events (e.g., "Phase 1," "Phase 2"). This framework breaks that limitation by enabling open-vocabulary temporal localization. Users can query long, untrimmed videos with free-form natural language to find relevant events (e.g., βfind when the grasper retracts the gallbladderβ).
Our framework is built on three pillars:
- Open-Vocabulary Localization: Leverages powerful vision-language models to understand and locate events described by arbitrary text queries, not just fixed labels.
- Architectural Scalability: Employs a hybrid Transformer and Structured State Space Model (SSM) architecture to efficiently process long-form surgical videos, overcoming the quadratic complexity of traditional attention mechanisms.
- Trustworthiness & Explainability: Integrates Evidential Deep Learning to quantify model uncertainty and provides visual attention maps to explain its predictions, fostering clinical trust and safety.
The following demonstration shows the framework successfully localizing the "Calot triangle dissection phase" in a 40-minute untrimmed procedure based solely on a natural language query.
Query: "Calot triangle dissection phase" | Peak Confidence: 99.9%
- End-to-End Open-Vocabulary TAL: A complete pipeline from data preprocessing to language-guided inference for surgical video analysis.
- Flexible Vision Backbones: Supports both a powerful MΒ²CRL pretrained video transformer and a highly efficient EndoMamba (SSM) backbone for long-sequence modeling.
- Parameter-Efficient Fine-Tuning (PEFT): Uses Low-Rank Adaptation (LoRA) to efficiently adapt a pretrained CLIP text encoder to the surgical domain with minimal computational cost.
- Advanced Temporal Modeling: Features a state-of-the-art Mamba-based Temporal Head that scales linearly with sequence length, making it ideal for hour-long procedural videos.
- Bi-Level Consistency Loss: A novel training objective that enforces temporal consistency at both the semantic and spatial levels using optical flow (RAFT) to regularize the model.
- Uncertainty Quantification: Implements Evidential Deep Learning (EDL) to allow the model to express its own confidence, reliably identifying out-of-distribution or ambiguous events.
- Built-in Explainability (XAI): Generates cross-modal attention maps to visualize which parts of a frame the model focused on to make its decision, a critical feature for clinical validation.
- Comprehensive Baseline Suite: Includes code and instructions to benchmark against canonical baselines like CLIP, X-CLIP, Moment-DETR, and TeCNO.
The system is a multi-stage pipeline designed to process spatial, semantic, and temporal information through specialized components:
- Vision Backbone: A pretrained MΒ²CRL model extracts a grid of powerful visual feature vectors from each video frame.
- Text Encoder: A LoRA-adapted CLIP text encoder processes the natural language query into a semantic feature vector.
- Language-Guided Fusion Head: A cross-modal transformer uses attention to fuse the visual and textual features. It identifies relevant spatial regions in the frame corresponding to the query, outputting initial
raw_scores, intermediate features for the consistency loss, andattention_weightsfor XAI. - Temporal Head (SSM/Mamba): This highly efficient head analyzes the sequence of fused features from the entire clip. It models long-range context to smooth predictions and fill gaps, producing final, contextually-aware
refined_scores. - Uncertainty & Prediction Head: In its SOTA configuration, this head uses Evidential Deep Learning to output not just a final score but also the parameters of a Beta distribution (
evidential_output), allowing for robust uncertainty quantification.
Language-Guided-Endoscopy-Localization/
β
βββ backbone/ # Vision backbones
β βββ endomamba.py # EndoMamba (SSM-based backbone)
β βββ vision_transformer.py # ViT-based backbone (MΒ²CRL, etc.)
β
βββ checkpoints/ # Saved checkpoints and logs
β
βββ comparison_models/ # Baseline and benchmark models
β βββ clip_baseline/
β β βββ clip_baseline.py # CLIP zero-shot / linear probe
β β
β βββ Moment-DETR/ # Moment-DETR temporal grounding
β β βββ run_evaluation.py
β β βββ run_feature_extraction.py
β β βββ run_preprocessing.py
β β βββ run_training.py
β β βββ moment_detr_module/
β β βββ __init__.py
β β βββ configs.py
β β βββ dataset.py
β β βββ engine.py
β β βββ loss.py
β β βββ matcher.py
β β βββ modeling.py
β β βββ position_encoding.py
β β βββ transformer.py
β β βββ utils.py
β β βββ README.md
β β
β βββ xclip_baseline/ # X-CLIP video-language baseline
β βββ train_xclip.py
β βββ eval_xclip.py
β βββ infer_xclip.py
β βββ requirements.txt
β βββ project_config.py
β βββ README_XCLIP.md
β βββ xclip_package/
β βββ xclip/
β βββ __init__.py
β βββ data.py
β βββ losses.py
β βββ metrics.py
β βββ model.py
β βββ utils.py
β
βββ dataset_preprocessing/ # Preprocessing for Cholec80 dataset
β βββ create_splits.py
β βββ extract_cholec80_frames.py
β βββ prepare_cholec80.py
β
βββ pretrained/ # Pretrained model weights
β βββ checkpoint.pth
β
βββ dataset.py # Dataset wrapper
βββ inference.py # Inference script (language-guided)
βββ models.py # Main model components
βββ project_config.py # Config file for project settings
βββ train.py # Training entry point
β
βββ README.md # Project documentation
βββ .gitignore
This framework is developed and evaluated on the Cholec80 dataset, which contains 80 videos of laparoscopic cholecystectomy procedures. Our preprocessing pipeline transforms this dataset into a format suitable for open-vocabulary learning.
- Frame Extraction: Videos are decoded into individual frames at a specified sampling rate.
python dataset_preprocessing/extract_cholec80_frames.py --cholec80_videos_dir /path/to/videos --output_frames_dir /path/to/frames
- Create Data Splits: The 80 videos are randomly partitioned into training, validation, and test sets to ensure fair evaluation.
python dataset_preprocessing/create_splits.py --video_dir /path/to/videos
- Generate Language Triplets: The core preprocessing step. This script reads the official phase and tool annotations and generates a CSV file of
(frame_path, text_query, relevance_label)triplets. This creates positive and negative examples for training the vision-language alignment.# Run for each split python dataset_preprocessing/prepare_cholec80.py --split train python dataset_preprocessing/prepare_cholec80.py --split val python dataset_preprocessing/prepare_cholec80.py --split test
-
Clone the repository:
git clone https://github.com/soheil-jafari/language-guided-endoscopy-localization.git cd language-guided-endoscopy-localization -
Create a Python environment and install dependencies. We recommend using Conda.
conda create -n endo-tal python=3.9 -y conda activate endo-tal pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118 pip install -r requirements.txt
-
Configuration: Before running any scripts, review and update the paths in
project_config.pyto match your system's directory structure.
The main training script train.py handles model training with support for various configurations controlled by project_config.py and command-line arguments.
- To train the model from scratch or fine-tune:
python train.py
- To fine-tune from an existing checkpoint:
python train.py --finetune_from /path/to/your/checkpoint.pth
- To run in debug mode on a small data subset:
python train.py --debug
- To adjust the training subset size (e.g., 50% of the data):
python train.py --subset 0.5
Use inference.py to run a trained model on a video to localize a specific language query.
python inference.py \
--video_path /path/to/your/video.mp4 \
--text_query "a grasper is present" \
--checkpoint_path /path/to/your/best_model.pthThe framework was evaluated on the held-out Cholec80 test split (8 full-length videos). Despite the computational constraints limiting training to only 4 epochs, the model demonstrated state-of-the-art potential in frame-level discrimination.
Our framework significantly outperforms general-domain baselines when applied to the specialized surgical environment.
| Model | AUROC (Discrimination) | AUPRC (Precision-Recall) | Best F1-Score |
|---|---|---|---|
| Proposed Framework (4 Epochs) | 0.933 | 0.887 | 0.820 |
| CLIP Baseline (Linear Probe) | 0.52 | 0.08 | 0.10 |
| X-CLIP Baseline | 0.50 | 0.07 | 0.12 |
| Moment-DETR | 0.51 | 0.06 | 0.11 |
Note: Baselines were trained and evaluated under identical conditions using the same Cholec80 splits.
The following Precision-Recall curve demonstrates the framework's ability to maintain high precision across various recall levels, achieving an AUPRC of 0.887.
In a focused case study on the "Calot triangle dissection phase" (Video 05), the model successfully identified the correct temporal neighborhood, achieving a peak confidence score of 0.9994.
The timeline below illustrates the model's activation scores across Video 05. Note the distinct probability peak aligned with the "Calot triangle dissection phase" ground truth.
π Baselines and Comparisons This repository includes the necessary code and instructions to benchmark our framework against three key families of models:
- General Vision-Language Models: For open-set, text-driven evaluation.
- CLIP: Zero-shot and linear-probe per-frame relevance scoring.
- X-CLIP: A powerful video-language model for scoring short clips.
- Temporal Grounding Models: For the direct task of localizing events from text.
- Moment-DETR: Predicts start/end boundaries from a language query.
- Surgical Specialist Models: Closed-set baselines trained specifically for Cholec80.
- TeCNO: A temporal convolutional network for surgical phase recognition.
- The code for these baselines can be found in the comparison_models/ directory. Each subfolder contains a README with specific instructions for running that model.
In high-stakes clinical environments, "black-box" predictions are insufficient. This framework integrates designed-in trustworthiness through two key mechanisms:
The system employs Evidential Deep Learning (EDL) to predict the parameters of a Beta distribution (
-
Inverse Uncertainty: The total evidence (
$S_t = \alpha_t + \beta_t$ ) provides an explicit measure of model confidence. - Safety: This allows the system to flag ambiguous or out-of-distribution events that require human surgeon review.
The framework generates Cross-Modal Attention Maps to visualize the model's reasoning.
- Focus: These heatmaps highlight the specific surgical tools (e.g., clip applier, grasper) or anatomical structures the model prioritized during a query.
- Validation: This provides clinicians with interpretable evidence that aligns AI predictions with familiar visual cues.
If you use this framework or ideas from our work in your research, please cite the following dissertation:
Code snippet
@mastersthesis{jafarifard2025,
title={An Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videos},
author={Soheil Jafarifard Bidgoli},
school={Aston University},
year={2025}
}This work was completed as part of the CS4700 Dissertation for the MSc in Computer Science at Aston University.
Supervisor: Dr. Zhuangzhuang Dai.
This project builds upon the foundational work of the Cholec80 dataset creators and the authors of MΒ²CRL, VideoMamba, CLIP, Moment-DETR, and other referenced works.

