Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
-
Updated
Feb 16, 2024 - Jupyter Notebook
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Toward Expert-Level Medical Text Validation with Language Models
Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration
The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.
A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.
Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.
Quantifying uncertainty in LLM-as-judge evals with conformal prediction.
Universal quality evaluation plugin for Claude Code — 7-dimension scoring (correctness, completeness, adherence, efficiency, safety), configurable rubrics, threshold blocking, auto-hooks & /judge command.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
Evaluation of meeting summaries in different topics in English and German. NLP evaluation metrics and the LLM-as-Judge (llama-3.1-8b) are used. A SOTA atomic fact extraction is implemented which enhanced the evaluation scores drastically
LLM evaluation framework with custom metrics, LLM-as-judge, and comprehensive reporting
The Self-Hosted AI Firewall & Gateway. Drop-in guardrails for LLMs running entirely on CPU. Blocks jailbreaks, enforces policies, and ensures compliance in real-time
🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider LLM-as-a-Judge (OpenAI, Anthropic, Gemini), automated testing, A/B benchmarking, and safety auditing.
A simple Contextual long term memory store that remembers users persona
Evaluate translations by either a self-hosted Embedder or using Chat-GPT as LLM-as-judge.
Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."