Skip to content

Commit 987f526

Browse files
Expanding Terminus Slicing PR (#597)
Expanding PR to include reward logic for string similarity and schema validation --------- Signed-off-by: Khushi Bhardwaj <kbhardwaj@nvidia.com>
1 parent 2e96cb6 commit 987f526

File tree

9 files changed

+996
-317
lines changed

9 files changed

+996
-317
lines changed
Lines changed: 59 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,65 @@
11
# Description
22

3-
Data links: ?
3+
This is a resource server for verifying terminal-based agent actions. It evaluates agent responses that represent terminal command sequences against expected answers. The server supports two different schema formats (`terminus_1` and `terminus_2`) for terminal interaction tasks.
4+
5+
For each verification request, the agent's JSON output is validated through multiple checks:
6+
1. **JSON Parsing**: The model output must be valid JSON
7+
2. **Schema Validation**: The response must conform to the specified harness schema (`terminus_1` or `terminus_2`)
8+
3. **Task Completion**: If the expected answer requires task completion, the agent must also indicate completion
9+
4. **Command Correctness**: The predicted keystrokes must exactly match the expected keystrokes in order
10+
- This is evaluated via string similarity and equivalency llm-as-judge
11+
12+
13+
## Supported Schemas
14+
15+
### terminus_1
16+
- `state_analysis`: Description of the current terminal state
17+
- `explanation`: Brief explanation of what the commands will do
18+
- `commands`: List of command objects with `keystrokes`, `is_blocking`, and `timeout_sec`
19+
- `is_task_complete`: Boolean indicating if the task is complete
20+
21+
### terminus_2
22+
- `analysis`: Analysis of the current state based on terminal output
23+
- `plan`: Description of the plan for next steps
24+
- `commands`: List of command objects with `keystrokes` and optional `duration`
25+
- `task_complete`: Boolean indicating if the task is complete (optional)
26+
27+
28+
# Example usage
29+
30+
## Running servers
31+
32+
The following command can be used to run this resource server, along with the simple agent and a policy model:
33+
34+
```bash
35+
config_paths="resources_servers/terminus_judge/configs/terminus_judge.yaml,\
36+
responses_api_models/openai_model/configs/openai_model.yaml"
37+
38+
ng_run "+config_paths=[$config_paths]" \
39+
+terminus_judge_resources_server.resources_servers.terminus_judge.judge_responses_create_params.max_output_tokens=512
40+
```
41+
42+
Then, rollouts can be collected using a command such as the following:
43+
44+
```bash
45+
ng_collect_rollouts +agent_name=terminus_judge_simple_agent \
46+
+input_jsonl_fpath=resources_servers/terminus_judge/data/example.jsonl \
47+
+output_jsonl_fpath=resources_servers/terminus_judge/example_rollouts.jsonl
48+
```
49+
50+
## Expected Data Format
51+
52+
Each data sample should include:
53+
- `expected_answer`: A JSON string containing the expected terminal commands
54+
- `metadata.harness`: Either `"terminus_1"` or `"terminus_2"` to specify the schema format
55+
- `threshold`: threshold for string similarity to calculate the reward
456

557
# Licensing information
6-
Code: ?
7-
Data: ?
858

9-
Dependencies
59+
Code: Apache 2.0<br>
60+
Data: TBD
61+
62+
## Dependencies
63+
1064
- nemo_gym: Apache 2.0
11-
?
65+
- openapi-schema-validator: BSD-3-Clause

0 commit comments

Comments
 (0)