|
1 | 1 | # Description |
2 | 2 |
|
3 | | -Data links: ? |
| 3 | +This is a resource server for verifying terminal-based agent actions. It evaluates agent responses that represent terminal command sequences against expected answers. The server supports two different schema formats (`terminus_1` and `terminus_2`) for terminal interaction tasks. |
| 4 | + |
| 5 | +For each verification request, the agent's JSON output is validated through multiple checks: |
| 6 | +1. **JSON Parsing**: The model output must be valid JSON |
| 7 | +2. **Schema Validation**: The response must conform to the specified harness schema (`terminus_1` or `terminus_2`) |
| 8 | +3. **Task Completion**: If the expected answer requires task completion, the agent must also indicate completion |
| 9 | +4. **Command Correctness**: The predicted keystrokes must exactly match the expected keystrokes in order |
| 10 | + - This is evaluated via string similarity and equivalency llm-as-judge |
| 11 | + |
| 12 | + |
| 13 | +## Supported Schemas |
| 14 | + |
| 15 | +### terminus_1 |
| 16 | +- `state_analysis`: Description of the current terminal state |
| 17 | +- `explanation`: Brief explanation of what the commands will do |
| 18 | +- `commands`: List of command objects with `keystrokes`, `is_blocking`, and `timeout_sec` |
| 19 | +- `is_task_complete`: Boolean indicating if the task is complete |
| 20 | + |
| 21 | +### terminus_2 |
| 22 | +- `analysis`: Analysis of the current state based on terminal output |
| 23 | +- `plan`: Description of the plan for next steps |
| 24 | +- `commands`: List of command objects with `keystrokes` and optional `duration` |
| 25 | +- `task_complete`: Boolean indicating if the task is complete (optional) |
| 26 | + |
| 27 | + |
| 28 | +# Example usage |
| 29 | + |
| 30 | +## Running servers |
| 31 | + |
| 32 | +The following command can be used to run this resource server, along with the simple agent and a policy model: |
| 33 | + |
| 34 | +```bash |
| 35 | +config_paths="resources_servers/terminus_judge/configs/terminus_judge.yaml,\ |
| 36 | +responses_api_models/openai_model/configs/openai_model.yaml" |
| 37 | + |
| 38 | +ng_run "+config_paths=[$config_paths]" \ |
| 39 | + +terminus_judge_resources_server.resources_servers.terminus_judge.judge_responses_create_params.max_output_tokens=512 |
| 40 | +``` |
| 41 | + |
| 42 | +Then, rollouts can be collected using a command such as the following: |
| 43 | + |
| 44 | +```bash |
| 45 | +ng_collect_rollouts +agent_name=terminus_judge_simple_agent \ |
| 46 | + +input_jsonl_fpath=resources_servers/terminus_judge/data/example.jsonl \ |
| 47 | + +output_jsonl_fpath=resources_servers/terminus_judge/example_rollouts.jsonl |
| 48 | +``` |
| 49 | + |
| 50 | +## Expected Data Format |
| 51 | + |
| 52 | +Each data sample should include: |
| 53 | +- `expected_answer`: A JSON string containing the expected terminal commands |
| 54 | +- `metadata.harness`: Either `"terminus_1"` or `"terminus_2"` to specify the schema format |
| 55 | +- `threshold`: threshold for string similarity to calculate the reward |
4 | 56 |
|
5 | 57 | # Licensing information |
6 | | -Code: ? |
7 | | -Data: ? |
8 | 58 |
|
9 | | -Dependencies |
| 59 | +Code: Apache 2.0<br> |
| 60 | +Data: TBD |
| 61 | + |
| 62 | +## Dependencies |
| 63 | + |
10 | 64 | - nemo_gym: Apache 2.0 |
11 | | -? |
| 65 | +- openapi-schema-validator: BSD-3-Clause |
0 commit comments