Skip to content

Adding terraform generation to openbench.#343

Open
kishorealliiita wants to merge 1 commit intogroq:mainfrom
kishorealliiita:feature/adding-benchmarking-for-terraform-generation
Open

Adding terraform generation to openbench.#343
kishorealliiita wants to merge 1 commit intogroq:mainfrom
kishorealliiita:feature/adding-benchmarking-for-terraform-generation

Conversation

@kishorealliiita
Copy link

@kishorealliiita kishorealliiita commented Jan 31, 2026

Summary

Adds a Terraform Generation benchmark to openbench: the model gets natural-language prompts and must produce Terraform (HCL) for two tasks (VPC + 3 subnets + 3 EC2, and S3 bucket + bucket policy). The scorer extracts .tf code blocks from the model output, runs terraform fmt, terraform init -backend=false, and terraform validate, and scores 1.0 only if init and validate succeed (no plan/apply or LocalStack). Includes dataset, eval, scorer, config entry, registry import, unit tests for extraction/dataset/scorer, and an integration test for bench eval terraform_generation.

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

  • Dataset (src/openbench/datasets/terraform_generation.py): load_dataset() returns a MemoryDataset of 2 samples (VPC+3 subnets+3 EC2, S3 bucket+bucket policy) with prompt text, target="pass", and task_id in metadata.
  • Scorer (src/openbench/scorers/terraform_generation.py): terraform_generation_scorer() parses .tf code blocks from the last assistant message, writes them to a temp dir, runs terraform fmt, terraform init -backend=false, and terraform validate; returns Score(1.0) only if init and validate succeed, otherwise Score(0.0) (no plan/apply or LocalStack).
  • Eval (src/openbench/evals/terraform_generation.py): @task terraform_generation() builds a Task with the above dataset, solver [generate()], and the custom scorer.
  • Config (src/openbench/config.py): Added terraform_generation to _BUILTIN_BENCHMARKS with BenchmarkMetadata (name, description, category, tags, module_path, function_name).
  • Registry (src/openbench/_registry.py): Import and re-export of terraform_generation from openbench.evals.terraform_generation.
  • Tests (tests/test_terraform_generation.py): Unit tests for _extract_tf_blocks, dataset loading, and scorer (no output, no blocks, invalid Terraform, valid minimal Terraform).
    Integration (tests/integration/test_cli.py): test_basic_terraform_generation() runs bench eval terraform_generation --limit 1 with a Groq model.

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant