Skip to content

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Notifications You must be signed in to change notification settings

tianyi-lab/TSRBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embodied-logo TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

📄 Paper | 🤗 Dataset | 🏠 Project Website

Fangxu Yu1, Xingang Guo2, Lingzhi Yuan1, Haoqiang Kang3, Hongyu Zhao1, Lianhui Qin3, Furong Huang1, Bin Hu2, Tianyi Zhou4

1University of Maryland, College Park   2University of Illinois Urbana-Champaign   3University of California, San Diego   4Mohamed Bin Zayed University of Artificial Intelligence

🔥 Overview

TSRBench is a large-scale, comprehensive benchmark designed to stress-test the time series understanding and reasoning capabilities of generalist models (LLMs, VLMs, and TSLLMs). Time series data pervades real-world environments and underpins decision-making in high-stakes domains like finance, healthcare, and industrial systems. However, existing benchmarks often treat time series as isolated numerical sequences, stripping away the semantic context essential for complex problem-solving, or focusing solely on surface-level pattern recognition.

TSRBench is more than a benchmark—it’s a multifaceted, standardized evaluation platform that not only uncovers the current challenges in time series reasoning but also provides actionable insights to push the boundaries of time series reasoning.

🚀 Key Features

  • 🛠️ Comprehensive Taxonomy & Scale: TSRBench categorizes capabilities into 4 major dimensions (Perception, Reasoning, Prediction, Decision-Making) spanning 15 specific tasks. With 4,125 problems from 13 diverse domains.

  • 🎯 Native Multi-Modal Support: Designed for generalist models, TSRBench supports four distinct modalities: text, image, text-image interleaved, and time series embeddings.

  • 🏹 Unified Evaluation Pipeline (API & Local): We provide a standardized setup to evaluate a wide range of models effortlessly:

    • Proprietary Models: Seamless integration with APIs (e.g., GPT-5, Gemini-2.5, DeepSeek).
    • Open-Source Models: Local execution support via vLLM for efficient inference.
  • 🔍 Fine-Grained Capability Assessment: TSRBench evaluates complex cognitive abilities.

Comparison with related benchmarks

Benchmark Multi-Dom. # Tasks # Questions Multivariate Perception Reasoning Prediction Decision Modality
TimeMMD 1 16K T
CiK 1 0.3K T
TimeSeriesExam 5 0.7K T, V
MTBench 4 2.4K T
EngineMT-QA 4 11K T
SciTS 7 51K T
TimeMQA 5 200K T
TSR-SUITE 4 4K T
TSRBench (Ours) 15 4.1K T, V, T+V

🖥️ Installation

Download repo

git clone git@github.com:tianyi-lab/TSRBench.git
cd TSRBench

Install VLLM for local inference

uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllm

Install openai for API inference

pip install openai==2.2.0

🚀 Quick Start

Proprietary Models

To evaluate the textual time series with LLMs, you could run

bash inference/text_gpt/text_inference.sh "your_oai_api_base_url" "your_oai_api_key"

To evaluate the visual time series with VLMs, you could run

bash inference/vision_gpt/vision_inference.sh "your_oai_api_base_url" "your_oai_api_key"

To input both textual and visual time series to VLMs, you could run

bash inference/multimodal_gpt/multimodal_inference.sh "your_oai_api_base_url" "your_oai_api_key"

Open-source Models

To evaluate the textual time series with open-source LLMs, you could run

bash inference/text_opensource/text_inference.sh

To evaluate the textual time series with open-source VLMs, you could run

bash inference/vision_opensource/vision_inference.sh

You could add more models in the *.sh files

Citation

@article{yu2026tsrbench,
  title={TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models},
  author={Yu, Fangxu and Guo, Xingang and Yuan, Lingzhi and Kang, Haoqiang and Zhao, Hongyu and Qin, Lianhui and Huang, Furong and Hu, Bin and Zhou, Tianyi},
  journal={arXiv preprint arXiv:2601.18744},
  year={2026}
}

About

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published