An Explainable, Language-Guided Framework for Open-Set Temporal Localization in Endoscopic Videos

This repository contains the official implementation for the dissertation titled: “An Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videos” by Soheil Jafarifard Bidgoli (MSc Computer Science, Aston University).

The project introduces a novel framework for localizing arbitrary, language-described events in long-form surgical videos, moving beyond the limitations of traditional closed-set recognition models.

🚀 Overview

Surgical and endoscopic procedures generate vast amounts of video data. Clinicians often need to find specific moments, but traditional AI models can only recognize a fixed, predefined set of events (e.g., "Phase 1," "Phase 2"). This framework breaks that limitation by enabling open-vocabulary temporal localization. Users can query long, untrimmed videos with free-form natural language to find relevant events (e.g., “find when the grasper retracts the gallbladder”).

Our framework is built on three pillars:

Open-Vocabulary Localization: Leverages powerful vision-language models to understand and locate events described by arbitrary text queries, not just fixed labels.
Architectural Scalability: Employs a hybrid Transformer and Structured State Space Model (SSM) architecture to efficiently process long-form surgical videos, overcoming the quadratic complexity of traditional attention mechanisms.
Trustworthiness & Explainability: Integrates Evidential Deep Learning to quantify model uncertainty and provides visual attention maps to explain its predictions, fostering clinical trust and safety.

Visualizing Temporal Localization

The following demonstration shows the framework successfully localizing the "Calot triangle dissection phase" in a 40-minute untrimmed procedure based solely on a natural language query.

Query: "Calot triangle dissection phase" | Peak Confidence: 99.9%

✨ Key Features

End-to-End Open-Vocabulary TAL: A complete pipeline from data preprocessing to language-guided inference for surgical video analysis.
Flexible Vision Backbones: Supports both a powerful M²CRL pretrained video transformer and a highly efficient EndoMamba (SSM) backbone for long-sequence modeling.
Parameter-Efficient Fine-Tuning (PEFT): Uses Low-Rank Adaptation (LoRA) to efficiently adapt a pretrained CLIP text encoder to the surgical domain with minimal computational cost.
Advanced Temporal Modeling: Features a state-of-the-art Mamba-based Temporal Head that scales linearly with sequence length, making it ideal for hour-long procedural videos.
Bi-Level Consistency Loss: A novel training objective that enforces temporal consistency at both the semantic and spatial levels using optical flow (RAFT) to regularize the model.
Uncertainty Quantification: Implements Evidential Deep Learning (EDL) to allow the model to express its own confidence, reliably identifying out-of-distribution or ambiguous events.
Built-in Explainability (XAI): Generates cross-modal attention maps to visualize which parts of a frame the model focused on to make its decision, a critical feature for clinical validation.
Comprehensive Baseline Suite: Includes code and instructions to benchmark against canonical baselines like CLIP, X-CLIP, Moment-DETR, and TeCNO.

🏗️ Framework Architecture

The system is a multi-stage pipeline designed to process spatial, semantic, and temporal information through specialized components:

Vision Backbone: A pretrained M²CRL model extracts a grid of powerful visual feature vectors from each video frame.
Text Encoder: A LoRA-adapted CLIP text encoder processes the natural language query into a semantic feature vector.
Language-Guided Fusion Head: A cross-modal transformer uses attention to fuse the visual and textual features. It identifies relevant spatial regions in the frame corresponding to the query, outputting initial raw_scores, intermediate features for the consistency loss, and attention_weights for XAI.
Temporal Head (SSM/Mamba): This highly efficient head analyzes the sequence of fused features from the entire clip. It models long-range context to smooth predictions and fill gaps, producing final, contextually-aware refined_scores.
Uncertainty & Prediction Head: In its SOTA configuration, this head uses Evidential Deep Learning to output not just a final score but also the parameters of a Beta distribution (evidential_output), allowing for robust uncertainty quantification.

📂 Repository Structure

Language-Guided-Endoscopy-Localization/
│
├── backbone/                         # Vision backbones
│   ├── endomamba.py                  # EndoMamba (SSM-based backbone)
│   └── vision_transformer.py         # ViT-based backbone (M²CRL, etc.)
│
├── checkpoints/                      # Saved checkpoints and logs
│
├── comparison_models/                # Baseline and benchmark models
│   ├── clip_baseline/
│   │   └── clip_baseline.py          # CLIP zero-shot / linear probe
│   │
│   ├── Moment-DETR/                  # Moment-DETR temporal grounding
│   │   ├── run_evaluation.py
│   │   ├── run_feature_extraction.py
│   │   ├── run_preprocessing.py
│   │   ├── run_training.py
│   │   └── moment_detr_module/
│   │       ├── __init__.py
│   │       ├── configs.py
│   │       ├── dataset.py
│   │       ├── engine.py
│   │       ├── loss.py
│   │       ├── matcher.py
│   │       ├── modeling.py
│   │       ├── position_encoding.py
│   │       ├── transformer.py
│   │       ├── utils.py
│   │       └── README.md
│   │
│   └── xclip_baseline/               # X-CLIP video-language baseline
│       ├── train_xclip.py
│       ├── eval_xclip.py
│       ├── infer_xclip.py
│       ├── requirements.txt
│       ├── project_config.py
│       ├── README_XCLIP.md
│       └── xclip_package/
│           └── xclip/
│               ├── __init__.py
│               ├── data.py
│               ├── losses.py
│               ├── metrics.py
│               ├── model.py
│               └── utils.py
│
├── dataset_preprocessing/            # Preprocessing for Cholec80 dataset
│   ├── create_splits.py
│   ├── extract_cholec80_frames.py
│   └── prepare_cholec80.py
│
├── pretrained/                       # Pretrained model weights
│   └── checkpoint.pth
│
├── dataset.py                        # Dataset wrapper
├── inference.py                      # Inference script (language-guided)
├── models.py                         # Main model components
├── project_config.py                 # Config file for project settings
├── train.py                          # Training entry point
│
├── README.md                         # Project documentation
└── .gitignore

🧑‍⚕️ Dataset: Cholec80

This framework is developed and evaluated on the Cholec80 dataset, which contains 80 videos of laparoscopic cholecystectomy procedures. Our preprocessing pipeline transforms this dataset into a format suitable for open-vocabulary learning.

Preprocessing Pipeline

Frame Extraction: Videos are decoded into individual frames at a specified sampling rate.

python dataset_preprocessing/extract_cholec80_frames.py --cholec80_videos_dir /path/to/videos --output_frames_dir /path/to/frames

Create Data Splits: The 80 videos are randomly partitioned into training, validation, and test sets to ensure fair evaluation.
```
python dataset_preprocessing/create_splits.py --video_dir /path/to/videos
```
Generate Language Triplets: The core preprocessing step. This script reads the official phase and tool annotations and generates a CSV file of (frame_path, text_query, relevance_label) triplets. This creates positive and negative examples for training the vision-language alignment.
```
# Run for each split
python dataset_preprocessing/prepare_cholec80.py --split train
python dataset_preprocessing/prepare_cholec80.py --split val
python dataset_preprocessing/prepare_cholec80.py --split test
```

⚙️ Setup and Usage

Installation

Clone the repository:

git clone https://github.com/soheil-jafari/language-guided-endoscopy-localization.git
cd language-guided-endoscopy-localization

Create a Python environment and install dependencies. We recommend using Conda.

conda create -n endo-tal python=3.9 -y
conda activate endo-tal
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Configuration: Before running any scripts, review and update the paths in project_config.py to match your system's directory structure.

Training

The main training script train.py handles model training with support for various configurations controlled by project_config.py and command-line arguments.

To train the model from scratch or fine-tune:
```
python train.py
```

To fine-tune from an existing checkpoint:

python train.py --finetune_from /path/to/your/checkpoint.pth

To run in debug mode on a small data subset:
```
python train.py --debug
```
To adjust the training subset size (e.g., 50% of the data):
```
python train.py --subset 0.5
```

Inference

Use inference.py to run a trained model on a video to localize a specific language query.

python inference.py \
    --video_path /path/to/your/video.mp4 \
    --text_query "a grasper is present" \
    --checkpoint_path /path/to/your/best_model.pth

📈 Performance & Key Results

The framework was evaluated on the held-out Cholec80 test split (8 full-length videos). Despite the computational constraints limiting training to only 4 epochs, the model demonstrated state-of-the-art potential in frame-level discrimination.

Quantitative Benchmarking

Our framework significantly outperforms general-domain baselines when applied to the specialized surgical environment.

Model	AUROC (Discrimination)	AUPRC (Precision-Recall)	Best F1-Score
Proposed Framework (4 Epochs)	0.933	0.887	0.820
CLIP Baseline (Linear Probe)	0.52	0.08	0.10
X-CLIP Baseline	0.50	0.07	0.12
Moment-DETR	0.51	0.06	0.11

Note: Baselines were trained and evaluated under identical conditions using the same Cholec80 splits.

Visualizing Model Accuracy

The following Precision-Recall curve demonstrates the framework's ability to maintain high precision across various recall levels, achieving an AUPRC of 0.887.

Qualitative Success

In a focused case study on the "Calot triangle dissection phase" (Video 05), the model successfully identified the correct temporal neighborhood, achieving a peak confidence score of 0.9994.

Temporal Localization Analysis

The timeline below illustrates the model's activation scores across Video 05. Note the distinct probability peak aligned with the "Calot triangle dissection phase" ground truth.

📊 Baselines and Comparisons This repository includes the necessary code and instructions to benchmark our framework against three key families of models:

General Vision-Language Models: For open-set, text-driven evaluation.
CLIP: Zero-shot and linear-probe per-frame relevance scoring.
X-CLIP: A powerful video-language model for scoring short clips.
Temporal Grounding Models: For the direct task of localizing events from text.
Moment-DETR: Predicts start/end boundaries from a language query.
Surgical Specialist Models: Closed-set baselines trained specifically for Cholec80.
TeCNO: A temporal convolutional network for surgical phase recognition.
The code for these baselines can be found in the comparison_models/ directory. Each subfolder contains a README with specific instructions for running that model.

🔍 Explainability & Trustworthiness

In high-stakes clinical environments, "black-box" predictions are insufficient. This framework integrates designed-in trustworthiness through two key mechanisms:

1. Uncertainty Quantification (EDL)

The system employs Evidential Deep Learning (EDL) to predict the parameters of a Beta distribution ($\alpha, \beta$) for every frame.

Inverse Uncertainty: The total evidence ($S_t = \alpha_t + \beta_t$) provides an explicit measure of model confidence.
Safety: This allows the system to flag ambiguous or out-of-distribution events that require human surgeon review.

2. Visual Rationales

The framework generates Cross-Modal Attention Maps to visualize the model's reasoning.

Focus: These heatmaps highlight the specific surgical tools (e.g., clip applier, grasper) or anatomical structures the model prioritized during a query.
Validation: This provides clinicians with interpretable evidence that aligns AI predictions with familiar visual cues.

📚 Citation

If you use this framework or ideas from our work in your research, please cite the following dissertation:

Code snippet

@mastersthesis{jafarifard2025,
  title={An Explainable, Language-Guided Framework for Open-Set Temporal Localization on Endoscopic Videos},
  author={Soheil Jafarifard Bidgoli},
  school={Aston University},
  year={2025}
}

🤝 Acknowledgements

This work was completed as part of the CS4700 Dissertation for the MSc in Computer Science at Aston University.

Supervisor: Dr. Zhuangzhuang Dai.

This project builds upon the foundational work of the Cholec80 dataset creators and the authors of M²CRL, VideoMamba, CLIP, Moment-DETR, and other referenced works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Explainable, Language-Guided Framework for Open-Set Temporal Localization in Endoscopic Videos

🚀 Overview

Visualizing Temporal Localization

✨ Key Features

🏗️ Framework Architecture

📂 Repository Structure

🧑‍⚕️ Dataset: Cholec80

Preprocessing Pipeline

⚙️ Setup and Usage

Installation

Training

Inference

📈 Performance & Key Results

Quantitative Benchmarking

Visualizing Model Accuracy

Qualitative Success

Temporal Localization Analysis

🔍 Explainability & Trustworthiness

1. Uncertainty Quantification (EDL)

2. Visual Rationales

📚 Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.idea		.idea
backbone		backbone
comparison_models		comparison_models
dataset_preprocessing		dataset_preprocessing
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
evaluat_segments.py		evaluat_segments.py
inference.py		inference.py
make_report.py		make_report.py
models.py		models.py
pr_curve.png		pr_curve.png
project_config.py		project_config.py
requirements.txt		requirements.txt
surgical_localization_demo.gif		surgical_localization_demo.gif
timeline.png		timeline.png
train.py		train.py

Soheil-jafari/Language-Guided-Endoscopy-Localization

Folders and files

Latest commit

History

Repository files navigation

An Explainable, Language-Guided Framework for Open-Set Temporal Localization in Endoscopic Videos

🚀 Overview

Visualizing Temporal Localization

✨ Key Features

🏗️ Framework Architecture

📂 Repository Structure

🧑‍⚕️ Dataset: Cholec80

Preprocessing Pipeline

⚙️ Setup and Usage

Installation

Training

Inference

📈 Performance & Key Results

Quantitative Benchmarking

Visualizing Model Accuracy

Qualitative Success

Temporal Localization Analysis

🔍 Explainability & Trustworthiness

1. Uncertainty Quantification (EDL)

2. Visual Rationales

📚 Citation

🤝 Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages