A Streamlit-based application for evaluating and comparing responses generated by Large Language Models (LLMs). This tool supports multiple evaluation methods, including pairwise comparisons, reference-based evaluations, criteria-based evaluations, hallucination detection, and traditional NLP metrics.
-
Non-LLM Evaluation:
- Compare bot responses against ground truth using traditional NLP metrics like BLEU, ROUGE, BERTScore, and Edit Distance.
-
Pairwise Comparison:
- Compare two LLM responses directly to determine which is better based on a detailed evaluation.
-
Reference-Free Criteria Evaluation:
- Evaluate responses on criteria such as accuracy, coherence, creativity, and relevance without requiring a ground truth.
-
Reference-Based Evaluation:
- Evaluate responses against a reference or ground truth answer with detailed scoring and explanations.
-
Hallucination Detection:
- Detect hallucinations in LLM-generated responses by comparing them to a provided context.
- Enter your OpenAI API key in the sidebar to enable LLM-based evaluations.
- Select an evaluation method from the sidebar:
- Non-LLM Evaluation
- Pairwise Comparison
- Reference-Free Criteria Evaluation
- Reference-Based Evaluation
- Hallucination Detection
- Follow the prompts in the main interface to input your data and generate evaluations.