PyTorch implementation [Cite]
- This repository provides code and trained models from our paper "An Experimental Study on Generating Plausible Textual Explanations for Video Summarization", written by Thomas Eleftheriadis, Evlampios Apostolidis and Vasileios Mezaris, and accepted for publication in the Proceedings of the IEEE Int. Conf. on Content-Based Multimedia Indexing (CBMI 2025), Dublin, Ireland, Oct. 2025.
- This software can be used to generate plausible textual explanations for the outcomes of a video summarization model. More specifically, our framework produces: a) visual explanations including the video fragments that influenced the most the decisions of the summarizer, using the model-specific (attention-based) and model-agnostic (LIME-based) explanation methods from Tsigos et al. (2024), and b) plausible textual explanations by integrating a state-of-the-art Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the produced visual explanations. The plausibility of a visual explanation is quantified by measuring the semantic overlap between its textual description and the textual description of the corresponding video summary, using two sentence embedding methods (SBERT, SimCSE). With this framework, a state-of-the-art method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, we ran experiments to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.
- This repository includes:
- Details about the main dependencies of the released code.
- Information for obtaining the videos of the utilized datasets.
- The employed features and pretrained models in our experiments.
- Instructions for producing visual and textual explanations for the videos of the utilized datasets, as well as for individual videos.
- Instructions for obtaining the reported evaluation results.
- Other details (citation, licence, acknowledgement).
The code was developed, checked and verified on an Ubuntu 20.04.6 PC with an NVIDIA RTX 4090 GPU and an i5-12600K CPU. All dependencies can be found inside the requirements.txt file, which can be used to set up the necessary virtual enviroment.
Regarding the temporal segmentation of the videos, the utilized fragments in our experiments are available in the data folder. These fragments were produced by the TransNetV2 shot segmentation method (for multi-shot videos) and the motion-driven method for sub-shot segmentation (for single-shot videos), described in Apostolidis et al. (2018). In case there is a need to re-run shot segmentation, please use the code from the official Github repository and set-up the necesary environment following the instructions in the aforementioned repository. In case there is a need to also re-run sub-shot segmentation, please contact us for providing access to the utilized method.
The path of the TransNetV2 project, along with its corresponding virtual environment can be set in the video_segmentation.py file. Please note that the path for the project is given relatively to the parent directory of this project, while the path of the virtual environment is given relatively to the root directory of the corresponding project.
If there is a need to use the default paths:
- Set the name of the root directory of the project to TransNetV2 and place it in the parent directory of this project.
- Set the name of the virtual environment of the project to .venv and place it inside the root directory of the corresponding project. This will result in the following project structure:
/Parent Directory
/TransNetV2
/.venv
...
...
/Text-XAI-Video-Summaries
...
The videos of the SumMe and TVSum datasets are available here. These videos have to be placed into the SumMe and TVSum directories of the data folder. Following, they have to be renamed according to the utilized naming format, using the provided rename_videos.py script.
The extracted deep features for the SumMe and TVSum videos are already available into the aforementioned directories. In case there is a need to extract these deep features from scratch (and store them into h5 files), please run the feature_extraction.py script. Otherwise, an h5 file will be produced automatically for each video and stored into the relevant directory of the data folder.
The produced h5 files have the following structure:
/key
/features 2D-array with shape (n_steps, feature-dimension)
/n_frames number of frames in original video
The utilized pre-trained models of the CA-SUM method are available within the models directory. Their performance, as well as some other training details, are reported below.
| Model | F1 score | Epoch | Split | Reg. Factor |
|---|---|---|---|---|
| summe.pkl | 59.14 | 383 | 4 | 0.5 |
| tvsum.pkl | 63.46 | 44 | 4 | 0.5 |
To produce visual explanations for the videos of the SumMe and TVSum datasets, and compute faithfulness (Disc+) scores for these explanations, please run the following command:
bash explain.sh
For each video in these datasets, this command:
- creates a new folder (if it does not already exist) in the directory where the video is stored,
- extracts deep features from the video frames and identifies the shots of the video, and stores the obtained data in h5 and txt files, respectively (if the files containing these data do not already exist),
- and creates a subfolder, named visual_explanation, with the following files:
- "attention_explanations.txt": contains the selected video fragments by the attention-based explanation method, in temporal order (i.e. based on their occurence in the video)
- "attention_importance.txt": contains the selected video fragments by the attention-based explanation method, ranked based on the assigned scores
- "lime_explanations.txt": contains the selected video fragments by the LIME-based explanation method, in temporal order (i.e. based on their occurence in the video)
- "lime_importance.txt": contains the selected video fragments by the LIME-based explanation method, ranked based on the assigned scores
- "sum_shots.txt": contains information (indices of the start and end frame) about the fragments of the video summary
- "fragments_explanation.txt": contains a ranking (in descending order) of the video fragments (represented by the indices of the start and end frame) according to the assigned scores by each explanation method
- "fragments_explanation_evaluation_metrics.csv": contains the computed faithfulness (Disc+) scores for each explanation method
- "indexes.csv": contains the indices of the video fragments ranked (in descending order) according to the assigned scores by each explanation method
Then, to produce textual descriptions of the created visual explanations, run the following command:
python explanation/text_explanation.py -d ../data/SumMe ../data/TVSum
For each video in these datasets, this command calls a subprocess, named LLAVA, which creates another subfolder, named textual_explanation, with the following files:
- "video_id_text.txt": contains the generated textual descriptions of the visual explanations and the video summary
- "video_id_similarities.csv": contains the computed SimCSE and SBERT scores for these textual explanations
To produce visual explanations for an individual video using both the model-specific (attention-based) and model-agnostic (LIME-based) methods of the framework, please run:
python explanation/explain.py --model MODEL_PATH --video VIDEO_PATH --fragments NUM_OF_FRAGMENTS
where, MODEL_PATH refers to the path of the trained summarization model, VIDEO_PATH refers to the path of the video, and NUM_OF_FRAGMENTS refers to the number of utilized video fragments for generating the explanations (optional; default = 3).
Then, to produce textual explanations for this video, please run:
python explanation/text_explanation.py -d VIDEO_PATH
Example:
python explanation/explain.py --m models/tvsum.pkl -v ../data/TVSum/video_17.mp4
python explanation/text_explanation.py -d ../data/TVSum/video_17.mp4
To get the overall evaluation results (for all videos of the used datasets) about the faithfulness (Disc+) of the produced visual explanations, please run the combine_fragment_evaluation_files.py script. The final scores are saved into the final_scores directory, that is placed inside the explanation folder. Then, to get the overall evaluation results about the plausibility of the produced textual explanations, please run the combine_similarities_files.py script. The average scores are saved as "averages_top_1.csv" and "averages_top_3.csv" in the same folder.
If you find our work, code or trained models useful in your work, please cite the following publication:
T. Eleftheriadis, E. Apostolidis, V. Mezaris, "An Experimental Study on Generating Plausible Textual Explanations for Video Summarization", IEEE Int. Conf. on Content-Based Multimedia Indexing (CBMI 2025), Dublin, Ireland, Oct. 2025.
BibTeX:
@inproceedings{eleftheriadis2025cbmi,
title={An Experimental Study on Generating Plausible Textual Explanations for Video Summarization},
author={Thomas Eleftheriadis and Evlampios Apostolidis and Vasileios Mezaris},
booktitle={2025 IEEE Int. Conf. on Content-Based Multimedia Indexing (CBMI)},
year={2025},
organization={IEEE}
}
You may also want to have a look at our previous publication, where extracting non-textual explanations was presented:
K. Tsigos, E. Apostolidis, V. Mezaris, "An Integrated Framework for Multi-Granular Explanation of Video Summarization", Frontiers in Signal Processing, vol. 4, 2024. DOI:10.3389/frsip.2024.1433388
BibTeX:
@ARTICLE{10.3389/frsip.2024.1433388,
AUTHOR={Tsigos, Konstantinos and Apostolidis, Evlampios and Mezaris, Vasileios },
TITLE={An integrated framework for multi-granular explanation of video summarization},
JOURNAL={Frontiers in Signal Processing},
VOLUME={4},
YEAR={2024},
URL={https://www.frontiersin.org/journals/signal-processing/articles/10.3389/frsip.2024.1433388},
DOI={10.3389/frsip.2024.1433388},
ISSN={2673-8198},
}
Copyright (c) 2025, Thomas Eleftheriadis, Evlampios Apostolidis, Vasileios Mezaris / CERTH-ITI. All rights reserved. This code is provided for academic, non-commercial use only. Redistribution and use in source and binary forms, with or without modification, are permitted for academic non-commercial use provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation provided with the distribution.
This software is provided by the authors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
This work was supported by the EU Horizon Europe programme under grant agreement 101070109 TransMIXR.