llm-as-judge

Here are 15 public repositories matching this topic...

minnesotanlp / cobbler

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

StanfordMIMI / MedVAL

Star

Toward Expert-Level Medical Text Validation with Language Models

medical-text llm-as-judge

Updated Oct 23, 2025
Python

johnsonfarmsus / openwebui-ab-mcts-pipeline

Star

Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration

docker machine-learning ai pipeline multi-model monte-carlo-tree-search research-software sakana llm reasoning-engine open-webui llm-as-judge advanced-reasoning open-webui-tools ab-mcts

Updated Oct 10, 2025
Python

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

Star

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

reinforcement-learning machine-learning-algorithms language-model reward-design rft ai-training deeplearning-ai-courses ai-optimization multi-step-reasoning ai-evaluation rlhf llm-fine-tuning opensource-ai llm-as-judge predibase grpo llm-development token-level-control

Updated Jun 13, 2025
Jupyter Notebook

Ufonia / wer-is-unaware

Star

A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.

healthcare-ai dspy llm-as-judge gepa

Updated Nov 25, 2025
Python

anaishowland / llm-judge-psai

Star

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

computer-vision evaluation evaluation-metrics performance-testing evaluation-framework web-agent llm-as-judge llm-as-a-judge llm-as-evaluator computer-use

Updated Oct 28, 2025
Python

ads2280 / llm-eval-uncertainty

Star

Quantifying uncertainty in LLM-as-judge evals with conformal prediction.

nlp evaluation ai-safety probabilistic-models llm-as-judge

Updated Jul 28, 2025
Python

sattyamjjain / verdict

Star

Universal quality evaluation plugin for Claude Code — 7-dimension scoring (correctness, completeness, adherence, efficiency, safety), configurable rubrics, threshold blocking, auto-hooks & /judge command.

python plugin quality benchmarking evaluation scoring developer-tools code-quality ai-agents rubric ai-evaluation llm-as-judge claude-code

Updated Feb 14, 2026
Python

MohsinCreed / LangfuseOllama

Star

Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.

docker open-source self-hosted free no-cost local-llm ollama langfuse llm-evaluation prompt-evaluation offline-ai llm-as-judge llm-observability ai-evals

Updated Feb 5, 2026
TypeScript

maly-phy / multilingual-summarization-evaluation

Star

Evaluation of meeting summaries in different topics in English and German. NLP evaluation metrics and the LLM-as-Judge (llama-3.1-8b) are used. A SOTA atomic fact extraction is implemented which enhanced the evaluation scores drastically

evaluation summarization multilingual-nlp llm-as-judge

Updated Dec 28, 2025
Python

jstilb / llm-eval-framework

Star

LLM evaluation framework with custom metrics, LLM-as-judge, and comprehensive reporting

python benchmarking metrics ai-safety fastapi responsible-ai ai-evaluation ai-testing llm-evaluation llm-as-judge

Updated Feb 6, 2026
Python

lorenzomaiuri-dev / svalinn-ai

Star

The Self-Hosted AI Firewall & Gateway. Drop-in guardrails for LLMs running entirely on CPU. Blocks jailbreaks, enforces policies, and ensures compliance in real-time

Updated Jan 6, 2026
Python

josephsenior / agent-evaluation-platform

Star

🚀 Professional-grade AI Agent Evaluation Platform. Multi-provider LLM-as-a-Judge (OpenAI, Anthropic, Gemini), automated testing, A/B benchmarking, and safety auditing.

Updated Dec 26, 2025
Python

AdityaSanthosh / UserMemory-Chat

Star

A simple Contextual long term memory store that remembers users persona

entity-ex llm-as-judge google-adk

Updated Feb 3, 2026
Python

mrseanryan / gpt-eval-translations

Star

Evaluate translations by either a self-hosted Embedder or using Chat-GPT as LLM-as-judge.

evaluation translation-evaluation llm-as-judge

Updated Jun 24, 2024
Python

Improve this page

Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-judge

Here are 15 public repositories matching this topic...

minnesotanlp / cobbler

StanfordMIMI / MedVAL

johnsonfarmsus / openwebui-ab-mcts-pipeline

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

Ufonia / wer-is-unaware

anaishowland / llm-judge-psai

ads2280 / llm-eval-uncertainty

sattyamjjain / verdict

MohsinCreed / LangfuseOllama

maly-phy / multilingual-summarization-evaluation

jstilb / llm-eval-framework

lorenzomaiuri-dev / svalinn-ai

josephsenior / agent-evaluation-platform

AdityaSanthosh / UserMemory-Chat

mrseanryan / gpt-eval-translations

Improve this page

Add this topic to your repo