Find your optimal open-weights model.
prompt-prix is a visual tool for running benchmark test suites across multiple LLMs simultaneously, helping you discover which model and quantization best fits your VRAM constraints and task requirements.
You have a 24GB GPU. Should you run qwen2.5-72b-instruct-q4_k_m or llama-3.1-70b-instruct-q5_k_s for tool calling? BFCL gives you leaderboard scores for full-precision models. That doesn't answer your question. This is a different kind of metric.
Run existing benchmarks against your candidate models, on your hardware, and see results side-by-side.
- Fan-out dispatch: Same test case → N models in parallel
- Work-stealing scheduler: Efficient multi-GPU utilization across heterogeneous workstations
- Visual comparison: Real-time streaming with Model × Test result grid
- Benchmark-native: Consumes BFCL and Inspect AI test formats directly
🚧 Active Development
The working codebase is on the development/testing branch.
| Tool | Purpose |
|---|---|
| BFCL | Function-calling benchmark with leaderboard |
| Inspect AI | Evaluation framework (UK AISI) |
| prompt-prix | Visual fan-out for model selection |
prompt-prix complements these tools—it's a visual layer for comparing models during selection, not a replacement for rigorous evaluation.
- Adapter pattern: OpenAI-compatible API now (LM Studio), extensible to Ollama/vLLM
- Fail-fast validation: Invalid benchmark files rejected immediately
- Pydantic state management: Explicit, typed, observable
- Work-stealing dispatcher: Asymmetric GPU setups handled automatically
MIT
- Development branch — working code
- BFCL — upstream benchmark source
- Inspect AI — UK AISI evaluation framework
(C) 2025 Reflective Attention