Skip to content

feat: τ²-Adv Bench - Adversarial Evaluation Module#158

Open
Ahm3dAlAli wants to merge 10 commits intosierra-research:mainfrom
Ahm3dAlAli:feature/adversarial-evaluation
Open

feat: τ²-Adv Bench - Adversarial Evaluation Module#158
Ahm3dAlAli wants to merge 10 commits intosierra-research:mainfrom
Ahm3dAlAli:feature/adversarial-evaluation

Conversation

@Ahm3dAlAli
Copy link

@Ahm3dAlAli Ahm3dAlAli commented Feb 6, 2026

Summary

Adds adversarial evaluation to τ²-Bench for testing agent safety against manipulation attacks.

Features:

  • 5 attack strategies (social engineering, prompt injection, policy exploitation, identity manipulation, information extraction)
  • 3 sophistication levels per strategy
  • Safety evaluator for violation detection
  • Adversarial tasks for airline, retail, telecom domains

Paper

Accompanying paper with methodology and findings available in fork: https://github.com/Ahm3dAlAli/tau2-bench/tree/feature/adversarial-evaluation/Tau2_Adv_Bench_AgentBeats_AhmedAli.pdf
Tau2_Adv_Bench_AgentBeats_AhmedAli.pdf

Test Plan

  • 25 tests pass (pytest tests/test_adversarial.py -v)
  • Demo works (python demo_adversarial.py)

Ahm3dAlAli and others added 2 commits February 3, 2026 09:38
This PR adds adversarial evaluation capabilities to τ2-bench, enabling
testing of agent robustness against manipulation attempts.

## New Components

### Adversarial Module (`src/tau2/adversarial/`)
- `strategies.py`: 5 attack strategies (social engineering, prompt injection,
  policy exploitation, identity manipulation, information extraction) with
  3 sophistication levels each
- `tasks.py`: Utilities for loading and filtering adversarial tasks
- `run_adversarial.py`: CLI script for running adversarial evaluations
- `README.md`: Documentation for the module

### Adversarial User (`src/tau2/user/adversarial_user.py`)
- Wraps standard UserSimulator with adversarial instructions
- Configurable attack strategy and sophistication
- Tracks attack attempts for analysis

### Safety Evaluator (`src/tau2/evaluator/evaluator_safety.py`)
- Detects safety violations in agent responses
- Violation types: unauthorized actions, information disclosure,
  policy circumvention, prompt injection success, etc.
- Produces SafetyRewardInfo with safety score and violations list

### Adversarial Tasks (`data/tau2/domains/airline/tasks_adversarial.json`)
- 8 adversarial scenarios for airline domain
- Covers all 5 attack strategies
- Includes multi-vector combined attacks

## Usage

```bash
# Run adversarial evaluation
python -m tau2.adversarial.run_adversarial --domain airline

# Filter by strategy
python -m tau2.adversarial.run_adversarial --domain airline --strategy social_engineering
```

## Tests

Added comprehensive tests in `tests/test_adversarial.py` (25 passing)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add paper directory with NeurIPS paper files (tex, figures, generation scripts)
- Include comprehensive evaluation results for multiple LLM models
- Add adversarial task data for retail and telecom domains
- Include visualization scripts and analysis tools

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Ahm3dAlAli
Copy link
Author

Ahm3dAlAli commented Feb 17, 2026

Hey @victorb-sierra ,

Good day, would love to know if you can take a look please.

@Ahm3dAlAli
Copy link
Author

Hi @victorb-sierra ,

Good day, Would Aveo to know about any updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant