This repository provides reference implementations of the evaluation metrics introduced in Learning to Judge: LLMs Designing and Applying Evaluation Rubrics.
Large language models (LLMs) are increasingly used as evaluators for natural language generation, typically applying human-defined rubrics to assess system outputs. GER-Eval investigates whether LLMs can instead design and apply their own evaluation rubrics, and how such LLM-defined criteria compare to human-defined ones in terms of reliability and alignment.
The README is intentionally focused on two things:
Tip: If you only want to analyze existing outputs, jump to
final_results/(precomputed runs + summaries) anddatasets/(data used for experiments).
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install python-dotenvCreate a .env file in the repo root:
OPENAI_API_KEY=...
# optional: OPENAI_MODEL=...
MISTRALAI_API_KEY=...
DEEPINFRA_API_KEY=...The pipeline runs in five stages:
- metric/rubric generation (definition)
- instruction generation (instructions)
- scoring (scoring)
- tagging student metrics vs teacher metrics (tagging)
- computing tag-based summary scores (tag_scores)
If you want the easiest “run it all again” option, use the provided runners:
python definement_total_runner.py
python instruction_total_runner.py
python scoring_total_runner.py
python tagging_total_runner.py
python tag_scores_runner.pyThese runners iterate over datasets and settings, and write outputs into results/.
If you want to validate your environment before running everything, run one dataset + one model:
A) Generate a rubric (“definition”)
python scripts/definement.py --data SummEval --api gpt-4o-mini --number_of_shots 3 --use_restrictionB) Generate instructions for that rubric
python scripts/instructions.py --data SummEval --api gpt-4o-mini --setting_name <DEFINITION_FILE.json>C) Score dataset outputs
python scripts/scoring.py --data SummEval --api gpt-4o-mini --setting_name <DEFINITION_FILE.json> --use_sample --number_of_shots 5D) Tag student metrics to teacher metrics
python scripts/tagging.py --data SummEval --setting_name <DEFINITION_FILE.json>E) Compute summary tag scores (coverage/diversity/unseen)
python scripts/tag_scores.py --data SummEval --setting_name <DEFINITION_FILE.json> --use_similarity True --similarity_threshold 0.82 --device cpuBy default, outputs follow stage-specific folders under results/ (created automatically). Typical paths:
results/definition/<DATASET>/...jsonresults/tag/<DATASET>/...jsonresults/tag/<DATASET>/scores/{embedding|name}/...json- plus summary CSVs in the
scores/folders
The repo includes a final_results/ directory containing precomputed outputs (and often aggregated summaries) that you can use to:
- compare settings without rerunning the LLM calls
- compute additional statistics
- build plots/tables for your own analysis
If you’re doing new analyses, start from final_results/ and/or copy artifacts into your own analysis scripts/notebooks.
This codebase is dataset-driven: most scripts assume that each dataset has a corresponding data_information/ package (pickled artifacts such as prompts, metric definitions, and other cached metadata). So the core workflow is:
- Check compatibility → 2) Create
data_information/pickles → 3) Run the pipeline.
Before you touch anything, compare your dataset to the existing ones (in datasets/ + data_information/) and confirm you can provide the same logical fields the pipeline relies on, typically:
- an input prompt (or instruction)
- a model output / candidate response to be judged
- optionally: a reference / gold text (for some tasks)
- a teacher rubric (ground-truth metrics) if you want alignment/coverage analyses
If your dataset can be represented in the same “prompt → response (→ reference)” structure as existing datasets, it’s usually compatible.
If not, you can still add it, but you’ll likely need to adapt the dataset loader + prompt construction (see “When things don’t match” below).
Create:
datasets/<YourDatasetName>/
Place your raw files there (often .csv). Many runners also expect a smaller subset:
test.csv(recommended) for quick runs and controlled evaluation.
The pipeline expects preprocessed dataset “information” to exist under:
data_information/<YourDatasetName>/
These files are pickle objects containing the prompts, metrics/teacher rubric objects, and other dataset-specific cached structures the scripts read at runtime.
To create them, use the provided notebook:
create_dataset_informations.ipynb(canonical way to generate the pickles)
If you have a teacher rubric (human-defined metrics), make sure you can:
- represent the metrics in the same format as the existing datasets (names + descriptions + scale)
- generate or provide evaluation instructions per metric (used by the
instructionsstage)
This is important because the tagging + tag-score stages compare student rubrics to teacher metrics.
Once your data_information/<YourDatasetName>/ exists, wire the dataset name into the code so it can be selected via CLI:
- add
<YourDatasetName>where dataset choices are defined inscripts/*.py - add a corresponding loader entry in the dataset loader (e.g.,
data/data_loader.py) pointing to:datasets/<YourDatasetName>/...data_information/<YourDatasetName>/...
Run a minimal end-to-end test:
python scripts/definement.py --data <YourDatasetName> --api gpt-4o-mini --number_of_shots 1
python scripts/tag_scores.py --data <YourDatasetName> --setting_name <DEFINITION_FILE.json> --device cpuIf this works, your dataset is correctly wired.
Even if your dataset is “compatible,” you may still need light adaptation, for example:
- renaming columns / fields to match what the loaders expect
- tweaking prompt formatting (so the LLM sees the same structure used in the paper runs)
- adjusting sampling (few-shot selection) logic for your dataset size or splits
If your dataset differs substantially (multi-turn dialogues, multiple references, etc.), you may need to:
- add a dataset-specific adapter in the loader
- modify prompt builders in
util/prompting_utils.py - update formatting/parsing in
util/formatting_tools.py
LLMs are routed through the API abstraction (see models/api_loader.py).
Add a new backend/client that returns model outputs in the same shape used by the existing backends (text completions + optional structured parsing).
Then register it so it can be used via:
--api <your_backend_name>Add any new keys to .env, for example:
MY_PROVIDER_API_KEY=...Run a minimal rubric generation call:
python scripts/definement.py --data SummEval --api <your_backend_name> --number_of_shots 1If this works, the rest of the pipeline typically works too (instructions → scoring → tagging → tag_scores).
Different LLMs may require small adjustments to get stable parsing and consistent rubric formatting, e.g.:
- stricter prompt constraints / formatting instructions
- tweaks to output parsing rules (if your model is “creative” with structure)
- temperature / decoding changes
Questions or ideas? You can reach me at alianpourya@gmail.com — I’m happy to help.
If you find an issue (bug, reproducibility problem, unclear docs, etc.), please open a GitHub Issue so we can track it and fix it.