Skip to content

Audit local LLM function calling and agentic reliability. Visual tool-use benchmarking for quantized models on YOUR hardware.

License

Notifications You must be signed in to change notification settings

shanevcantwell/prompt-prix

Repository files navigation

image

prompt-prix

Find your optimal open-weights model.

prompt-prix is a visual tool for running benchmark test suites across multiple LLMs simultaneously, helping you discover which model and quantization best fits your VRAM constraints and task requirements.

The Problem

You have a 24GB GPU. Should you run qwen2.5-72b-instruct-q4_k_m or llama-3.1-70b-instruct-q5_k_s for tool calling? BFCL gives you leaderboard scores for full-precision models. That doesn't answer your question. This is a different kind of metric.

The Solution

Run existing benchmarks against your candidate models, on your hardware, and see results side-by-side.

  • Fan-out dispatch: Same test case → N models in parallel
  • Work-stealing scheduler: Efficient multi-GPU utilization across heterogeneous workstations
  • Visual comparison: Real-time streaming with Model × Test result grid
  • Benchmark-native: Consumes BFCL and Inspect AI test formats directly
image

Status

🚧 Active Development

The working codebase is on the development/testing branch.

Ecosystem Position

Tool Purpose
BFCL Function-calling benchmark with leaderboard
Inspect AI Evaluation framework (UK AISI)
prompt-prix Visual fan-out for model selection

prompt-prix complements these tools—it's a visual layer for comparing models during selection, not a replacement for rigorous evaluation.

Architecture Highlights

  • Adapter pattern: OpenAI-compatible API now (LM Studio), extensible to Ollama/vLLM
  • Fail-fast validation: Invalid benchmark files rejected immediately
  • Pydantic state management: Explicit, typed, observable
  • Work-stealing dispatcher: Asymmetric GPU setups handled automatically

License

MIT

Links

(C) 2025 Reflective Attention