Skip to content

vihaannnn/LLM-Judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-As-a-Judge

A Streamlit-based application for evaluating and comparing responses generated by Large Language Models (LLMs). This tool supports multiple evaluation methods, including pairwise comparisons, reference-based evaluations, criteria-based evaluations, hallucination detection, and traditional NLP metrics.

Features

  1. Non-LLM Evaluation:

    • Compare bot responses against ground truth using traditional NLP metrics like BLEU, ROUGE, BERTScore, and Edit Distance.
  2. Pairwise Comparison:

    • Compare two LLM responses directly to determine which is better based on a detailed evaluation.
  3. Reference-Free Criteria Evaluation:

    • Evaluate responses on criteria such as accuracy, coherence, creativity, and relevance without requiring a ground truth.
  4. Reference-Based Evaluation:

    • Evaluate responses against a reference or ground truth answer with detailed scoring and explanations.
  5. Hallucination Detection:

    • Detect hallucinations in LLM-generated responses by comparing them to a provided context.

Usage

  1. Enter your OpenAI API key in the sidebar to enable LLM-based evaluations.
  2. Select an evaluation method from the sidebar:
  • Non-LLM Evaluation
  • Pairwise Comparison
  • Reference-Free Criteria Evaluation
  • Reference-Based Evaluation
  • Hallucination Detection
  1. Follow the prompts in the main interface to input your data and generate evaluations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages