GER-Eval Metrics

This repository provides reference implementations of the evaluation metrics introduced in Learning to Judge: LLMs Designing and Applying Evaluation Rubrics.

Large language models (LLMs) are increasingly used as evaluators for natural language generation, typically applying human-defined rubrics to assess system outputs. GER-Eval investigates whether LLMs can instead design and apply their own evaluation rubrics, and how such LLM-defined criteria compare to human-defined ones in terms of reliability and alignment.

The README is intentionally focused on two things:

How to reproduce / rerun the pipeline
How to plug in your own datasets and LLMs

Tip: If you only want to analyze existing outputs, jump to final_results/ (precomputed runs + summaries) and datasets/ (data used for experiments).

1) Reproducing the experiments (rerun end-to-end)

1.1 Install

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install python-dotenv

1.2 Set API keys

Create a .env file in the repo root:

OPENAI_API_KEY=...
# optional: OPENAI_MODEL=...

MISTRALAI_API_KEY=...
DEEPINFRA_API_KEY=...

1.3 Run the full pipeline (paper-style)

The pipeline runs in five stages:

metric/rubric generation (definition)
instruction generation (instructions)
scoring (scoring)
tagging student metrics vs teacher metrics (tagging)
computing tag-based summary scores (tag_scores)

If you want the easiest “run it all again” option, use the provided runners:

python definement_total_runner.py
python instruction_total_runner.py
python scoring_total_runner.py
python tagging_total_runner.py
python tag_scores_runner.py

These runners iterate over datasets and settings, and write outputs into results/.

1.4 Run a single experiment (quick sanity check)

If you want to validate your environment before running everything, run one dataset + one model:

A) Generate a rubric (“definition”)

python scripts/definement.py --data SummEval --api gpt-4o-mini --number_of_shots 3 --use_restriction

B) Generate instructions for that rubric

python scripts/instructions.py --data SummEval --api gpt-4o-mini --setting_name <DEFINITION_FILE.json>

C) Score dataset outputs

python scripts/scoring.py --data SummEval --api gpt-4o-mini --setting_name <DEFINITION_FILE.json> --use_sample --number_of_shots 5

D) Tag student metrics to teacher metrics

python scripts/tagging.py --data SummEval --setting_name <DEFINITION_FILE.json>

E) Compute summary tag scores (coverage/diversity/unseen)

python scripts/tag_scores.py --data SummEval --setting_name <DEFINITION_FILE.json> --use_similarity True --similarity_threshold 0.82 --device cpu

1.5 Where outputs go

By default, outputs follow stage-specific folders under results/ (created automatically). Typical paths:

results/definition/<DATASET>/...json
results/tag/<DATASET>/...json
results/tag/<DATASET>/scores/{embedding|name}/...json
plus summary CSVs in the scores/ folders

1.6 Using `final_results/` for analysis (no rerun required)

The repo includes a final_results/ directory containing precomputed outputs (and often aggregated summaries) that you can use to:

compare settings without rerunning the LLM calls
compute additional statistics
build plots/tables for your own analysis

If you’re doing new analyses, start from final_results/ and/or copy artifacts into your own analysis scripts/notebooks.

2) Extending the framework (add your datasets & LLMs)

This codebase is dataset-driven: most scripts assume that each dataset has a corresponding data_information/ package (pickled artifacts such as prompts, metric definitions, and other cached metadata). So the core workflow is:

Check compatibility → 2) Create data_information/ pickles → 3) Run the pipeline.

2.1 Add your own dataset

Step 0 — Check whether your dataset is compatible

Before you touch anything, compare your dataset to the existing ones (in datasets/ + data_information/) and confirm you can provide the same logical fields the pipeline relies on, typically:

an input prompt (or instruction)
a model output / candidate response to be judged
optionally: a reference / gold text (for some tasks)
a teacher rubric (ground-truth metrics) if you want alignment/coverage analyses

If your dataset can be represented in the same “prompt → response (→ reference)” structure as existing datasets, it’s usually compatible.

If not, you can still add it, but you’ll likely need to adapt the dataset loader + prompt construction (see “When things don’t match” below).

Step 1 — Add raw data under `datasets/`

Create:

datasets/<YourDatasetName>/

Place your raw files there (often .csv). Many runners also expect a smaller subset:

test.csv (recommended) for quick runs and controlled evaluation.

Step 2 — Create the required `data_information/` artifacts (pickles)

The pipeline expects preprocessed dataset “information” to exist under:

data_information/<YourDatasetName>/

These files are pickle objects containing the prompts, metrics/teacher rubric objects, and other dataset-specific cached structures the scripts read at runtime.

To create them, use the provided notebook:

create_dataset_informations.ipynb (canonical way to generate the pickles)

Step 3 — (Optional but recommended) Write instructions / extract metric info

If you have a teacher rubric (human-defined metrics), make sure you can:

represent the metrics in the same format as the existing datasets (names + descriptions + scale)
generate or provide evaluation instructions per metric (used by the instructions stage)

This is important because the tagging + tag-score stages compare student rubrics to teacher metrics.

Step 4 — Register the dataset in the loader / CLI

Once your data_information/<YourDatasetName>/ exists, wire the dataset name into the code so it can be selected via CLI:

add <YourDatasetName> where dataset choices are defined in scripts/*.py
add a corresponding loader entry in the dataset loader (e.g., data/data_loader.py) pointing to:
- datasets/<YourDatasetName>/...
- data_information/<YourDatasetName>/...

Step 5 — Smoke test

Run a minimal end-to-end test:

python scripts/definement.py --data <YourDatasetName> --api gpt-4o-mini --number_of_shots 1
python scripts/tag_scores.py --data <YourDatasetName> --setting_name <DEFINITION_FILE.json> --device cpu

If this works, your dataset is correctly wired.

When things don’t match (expected & normal)

Even if your dataset is “compatible,” you may still need light adaptation, for example:

renaming columns / fields to match what the loaders expect
tweaking prompt formatting (so the LLM sees the same structure used in the paper runs)
adjusting sampling (few-shot selection) logic for your dataset size or splits

If your dataset differs substantially (multi-turn dialogues, multiple references, etc.), you may need to:

add a dataset-specific adapter in the loader
modify prompt builders in util/prompting_utils.py
update formatting/parsing in util/formatting_tools.py

2.2 Add your own LLM / model backend

LLMs are routed through the API abstraction (see models/api_loader.py).

Step 1 — Add the model

Add a new backend/client that returns model outputs in the same shape used by the existing backends (text completions + optional structured parsing).

Then register it so it can be used via:

--api <your_backend_name>

Step 2 — Add credentials (if needed)

Add any new keys to .env, for example:

MY_PROVIDER_API_KEY=...

Step 3 — Validate

Run a minimal rubric generation call:

python scripts/definement.py --data SummEval --api <your_backend_name> --number_of_shots 1

If this works, the rest of the pipeline typically works too (instructions → scoring → tagging → tag_scores).

Step 4 — Expect minor tuning when swapping models

Different LLMs may require small adjustments to get stable parsing and consistent rubric formatting, e.g.:

stricter prompt constraints / formatting instructions
tweaks to output parsing rules (if your model is “creative” with structure)
temperature / decoding changes

Contact

Questions or ideas? You can reach me at alianpourya@gmail.com — I’m happy to help.

If you find an issue (bug, reproducibility problem, unclear docs, etc.), please open a GitHub Issue so we can track it and fix it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GER-Eval Metrics

1) Reproducing the experiments (rerun end-to-end)

1.1 Install

1.2 Set API keys

1.3 Run the full pipeline (paper-style)

1.4 Run a single experiment (quick sanity check)

1.5 Where outputs go

1.6 Using `final_results/` for analysis (no rerun required)

2) Extending the framework (add your datasets & LLMs)

2.1 Add your own dataset

Step 0 — Check whether your dataset is compatible

Step 1 — Add raw data under `datasets/`

Step 2 — Create the required `data_information/` artifacts (pickles)

Step 3 — (Optional but recommended) Write instructions / extract metric info

Step 4 — Register the dataset in the loader / CLI

Step 5 — Smoke test

When things don’t match (expected & normal)

2.2 Add your own LLM / model backend

Step 1 — Add the model

Step 2 — Add credentials (if needed)

Step 3 — Validate

Step 4 — Expect minor tuning when swapping models

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
data_information		data_information
datasets		datasets
final_results		final_results
models		models
prompts		prompts
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
definement_total_runner.py		definement_total_runner.py
instruction_total_runner.py		instruction_total_runner.py
requirements.txt		requirements.txt
scoring_total_runner.py		scoring_total_runner.py
tag_scores_runner.py		tag_scores_runner.py
tagging_total_runner.py		tagging_total_runner.py

License

Clemenciah/llm-generated-rubrics

Folders and files

Latest commit

History

Repository files navigation

GER-Eval Metrics

1) Reproducing the experiments (rerun end-to-end)

1.1 Install

1.2 Set API keys

1.3 Run the full pipeline (paper-style)

1.4 Run a single experiment (quick sanity check)

1.5 Where outputs go

1.6 Using final_results/ for analysis (no rerun required)

2) Extending the framework (add your datasets & LLMs)

2.1 Add your own dataset

Step 0 — Check whether your dataset is compatible

Step 1 — Add raw data under datasets/

Step 2 — Create the required data_information/ artifacts (pickles)

Step 3 — (Optional but recommended) Write instructions / extract metric info

Step 4 — Register the dataset in the loader / CLI

Step 5 — Smoke test

When things don’t match (expected & normal)

2.2 Add your own LLM / model backend

Step 1 — Add the model

Step 2 — Add credentials (if needed)

Step 3 — Validate

Step 4 — Expect minor tuning when swapping models

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1.6 Using `final_results/` for analysis (no rerun required)

Step 1 — Add raw data under `datasets/`

Step 2 — Create the required `data_information/` artifacts (pickles)

Packages