Skip to content

Commit 970b437

Browse files
authored
docs: issue 626 (#638)
Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent 2007202 commit 970b437

File tree

4 files changed

+258
-391
lines changed

4 files changed

+258
-391
lines changed

docs/data/download-huggingface.md

Lines changed: 88 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,71 @@ ng_download_dataset_from_hf \
9292

9393
::::
9494

95+
::::{tab-item} Python Script
96+
Downloads using the `datasets` library directly with streaming support.
97+
98+
**Use when**: You need custom preprocessing, streaming for large datasets, or specific split handling.
99+
100+
```python
101+
import json
102+
from datasets import load_dataset
103+
104+
output_file = "train.jsonl"
105+
dataset_name = "nvidia/OpenMathInstruct-2"
106+
split_name = "train_1M" # Check dataset page for available splits
107+
108+
with open(output_file, "w", encoding="utf-8") as f:
109+
for line in load_dataset(dataset_name, split=split_name, streaming=True):
110+
f.write(json.dumps(line) + "\n")
111+
```
112+
113+
Run the script:
114+
115+
```bash
116+
uv run download.py
117+
```
118+
119+
Verify the download:
120+
121+
```bash
122+
wc -l train.jsonl
123+
# Expected: 1000000 train.jsonl
124+
```
125+
126+
**Streaming benefits**:
127+
- Memory-efficient for large datasets (millions of rows)
128+
- Progress visible during download
129+
130+
:::{note}
131+
For gated or private datasets, authenticate first:
132+
133+
```bash
134+
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
135+
```
136+
137+
Or use `huggingface-cli login` before running the script.
138+
:::
139+
140+
::::
141+
95142
:::::
96143

97144
---
98145

146+
## NVIDIA Datasets
147+
148+
Ready-to-use datasets for common training tasks:
149+
150+
| Dataset | Repository | Domain |
151+
|---------|-----------|--------|
152+
| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` | Math |
153+
| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` | Code |
154+
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` | Agent |
155+
| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` | Instruction |
156+
| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` | Knowledge |
157+
158+
---
159+
99160
## Troubleshooting
100161

101162
::::{dropdown} Authentication Failed (401)
@@ -147,7 +208,7 @@ Avoid passing tokens on the command line—they appear in shell history.
147208
**Recommended** — Use environment variable:
148209

149210
```bash
150-
export hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
211+
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
151212
ng_download_dataset_from_hf \
152213
+repo_id=my-org/private-dataset \
153214
+output_dirpath=./data/
@@ -168,20 +229,6 @@ ng_download_dataset_from_hf \
168229
```
169230
:::
170231

171-
---
172-
173-
## NVIDIA Datasets
174-
175-
| Dataset | Repository |
176-
|---------|-----------|
177-
| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` |
178-
| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` |
179-
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` |
180-
| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` |
181-
| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` |
182-
183-
---
184-
185232
:::{dropdown} Automatic Downloads During Data Preparation
186233
:icon: download
187234

@@ -238,7 +285,30 @@ rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
238285
| Auto-download | `nemo_gym/train_data_utils.py:476-494` |
239286
:::
240287

241-
## Related
288+
## Next Steps
289+
290+
::::{grid} 1 2 2 2
291+
:gutter: 3
292+
293+
:::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Prepare and Validate
294+
:link: prepare-validate
295+
:link-type: doc
296+
297+
Preprocess raw data, run `ng_prepare_data`, and add `agent_ref` routing.
298+
:::
299+
300+
:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Collect Rollouts
301+
:link: /get-started/rollout-collection
302+
:link-type: doc
242303

243-
- {doc}`prepare-validate` — Validate downloaded datasets
244-
- {doc}`/reference/cli-commands` — Full CLI reference
304+
Generate training examples by running your agent on prepared data.
305+
:::
306+
307+
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Train with NeMo RL
308+
:link: /tutorials/nemo-rl-grpo/index
309+
:link-type: doc
310+
311+
Use validated data with NeMo RL for GRPO training.
312+
:::
313+
314+
::::

docs/data/index.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
(data-index)=
22
# Data
33

4-
NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an agent server—the component that orchestrates agent-environment interactions during training.
4+
NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an **agent server** (orchestrates agent-environment interactions) which routes requests to a **resources server** (provides tools and computes rewards).
55

66
## Prerequisites
77

@@ -28,6 +28,38 @@ Additional fields like `expected_answer` vary by resources server—the componen
2828

2929
**Source**: `nemo_gym/base_resources_server.py:35-36`
3030

31+
### Required Fields
32+
33+
| Field | Added By | Description |
34+
|-------|----------|-------------|
35+
| `responses_create_params` | User | Input to the model during training. Contains `input` (messages) and optional `tools`, `temperature`, etc. |
36+
| `agent_ref` | `ng_prepare_data` | Routes each row to its resource server. Auto-generated during data preparation. |
37+
38+
### Optional Fields
39+
40+
| Field | Description |
41+
|-------|-------------|
42+
| `expected_answer` | Ground truth for verification (task-specific). |
43+
| `question` | Original question text (for reference). |
44+
| `id` | Tracking identifier. |
45+
46+
:::{tip}
47+
Check `resources_servers/<name>/README.md` for fields required by each resource server's `verify()` method.
48+
:::
49+
50+
### The `agent_ref` Field
51+
52+
The `agent_ref` field maps each row to a specific resource server. A training dataset can blend multiple resource servers in a single file—`agent_ref` tells NeMo Gym which server handles each row.
53+
54+
```json
55+
{
56+
"responses_create_params": {"input": [{"role": "user", "content": "..."}]},
57+
"agent_ref": {"type": "responses_api_agents", "name": "math_with_judge_simple_agent"}
58+
}
59+
```
60+
61+
**You don't create `agent_ref` manually.** The `ng_prepare_data` tool adds it automatically based on your config file. The tool matches the agent type (`responses_api_agents`) with the agent name from the config.
62+
3163
### Example Data
3264

3365
```json

docs/data/prepare-validate.md

Lines changed: 137 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ Success output:
3232
####################################################################################################
3333
```
3434

35-
This generates `data/test/example_metrics.json` with dataset statistics.
35+
This generates two types of output:
36+
- **Per-dataset metrics**: `resources_servers/example_multi_step/data/example_metrics.json` (alongside source JSONL)
37+
- **Aggregated metrics**: `data/test/example_metrics.json` (in output directory)
3638

3739
---
3840

@@ -85,6 +87,130 @@ Check `resources_servers/<name>/README.md` for required fields specific to each
8587

8688
---
8789

90+
## Preprocess Raw Datasets
91+
92+
If your dataset doesn't have `responses_create_params`, you need to preprocess it before using `ng_prepare_data`.
93+
94+
**When to preprocess**:
95+
- Downloaded datasets without NeMo Gym format
96+
- Custom data needing system prompts
97+
- Need to split into train/validation sets
98+
99+
### Add `responses_create_params`
100+
101+
The `responses_create_params` field wraps your input in the Responses API format. This typically includes a system prompt and the user content.
102+
103+
::::{dropdown} Preprocessing script (preprocess.py)
104+
:icon: code
105+
:open:
106+
107+
Save this script as `preprocess.py`. It reads a raw JSONL file, adds `responses_create_params`, and splits into train/validation:
108+
109+
```python
110+
import json
111+
import os
112+
113+
# Configuration — customize these for your dataset
114+
INPUT_FIELD = "problem" # Field containing the input text (e.g., "problem", "question", "prompt")
115+
FILENAME = "raw_data.jsonl"
116+
SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
117+
TRAIN_RATIO = 0.999 # 99.9% train, 0.1% validation
118+
119+
dirpath = os.path.dirname(FILENAME) or "."
120+
with open(FILENAME, "r", encoding="utf-8") as fin, \
121+
open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
122+
open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
123+
124+
lines = list(fin)
125+
split_idx = int(len(lines) * TRAIN_RATIO)
126+
127+
for i, line in enumerate(lines):
128+
if not line.strip():
129+
continue
130+
row = json.loads(line)
131+
132+
# Remove fields not needed for training (optional)
133+
row.pop("generated_solution", None)
134+
row.pop("problem_source", None)
135+
136+
# Add responses_create_params
137+
row["responses_create_params"] = {
138+
"input": [
139+
{"role": "developer", "content": SYSTEM_PROMPT},
140+
{"role": "user", "content": row.get(INPUT_FIELD, "")},
141+
]
142+
}
143+
144+
out = json.dumps(row) + "\n"
145+
(ftrain if i < split_idx else fval).write(out)
146+
```
147+
148+
:::{important}
149+
You must customize these variables for your dataset:
150+
- `INPUT_FIELD`: The field name containing your input text. Common values: `"problem"` (math), `"question"` (QA), `"prompt"` (general), `"instruction"` (instruction-following)
151+
- `SYSTEM_PROMPT`: Task-specific instructions for the model
152+
- `TRAIN_RATIO`: Train/validation split ratio
153+
:::
154+
155+
::::
156+
157+
Run and verify:
158+
159+
```bash
160+
uv run preprocess.py
161+
wc -l train.jsonl validation.jsonl
162+
```
163+
164+
### Create Config for Custom Data
165+
166+
After preprocessing, create a config file to point `ng_prepare_data` at your local files.
167+
168+
::::{dropdown} Example config: custom_data.yaml
169+
:icon: file-code
170+
171+
```yaml
172+
custom_resources_server:
173+
resources_servers:
174+
custom_server:
175+
entrypoint: app.py
176+
domain: math # math | coding | agent | knowledge | other
177+
description: Custom math dataset
178+
verified: false
179+
180+
custom_simple_agent:
181+
responses_api_agents:
182+
simple_agent:
183+
entrypoint: app.py
184+
resources_server:
185+
type: resources_servers
186+
name: custom_resources_server
187+
model_server:
188+
type: responses_api_models
189+
name: policy_model
190+
datasets:
191+
- name: train
192+
type: train
193+
jsonl_fpath: train.jsonl
194+
license: Creative Commons Attribution 4.0 International
195+
- name: validation
196+
type: validation
197+
jsonl_fpath: validation.jsonl
198+
license: Creative Commons Attribution 4.0 International
199+
```
200+
201+
::::
202+
203+
Run data preparation:
204+
205+
```bash
206+
config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,custom_data.yaml"
207+
ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data
208+
```
209+
210+
This validates your data and adds the `agent_ref` field to each row, routing samples to your resource server.
211+
212+
---
213+
88214
## Validation Modes
89215

90216
| Mode | Purpose | Validates |
@@ -130,7 +256,9 @@ ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/wo
130256
| Invalid role | Sample skipped | Use `user`, `assistant`, `system`, or `developer` |
131257
| Missing dataset file | `AssertionError` | Create file or set `+should_download=true` |
132258

133-
**Key behavior**: Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data.
259+
:::{warning}
260+
Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
261+
:::
134262

135263
::::{dropdown} Find invalid samples
136264
:icon: code
@@ -174,9 +302,15 @@ with open("your_data.jsonl") as f:
174302
4. **Compute metrics** — Aggregate statistics
175303
5. **Collate** — Combine samples with agent references
176304

305+
### Output Locations
306+
307+
Metrics files are written to two locations:
308+
- **Per-dataset**: `{dataset_jsonl_path}_metrics.json` — alongside each source JSONL file
309+
- **Aggregated**: `{output_dirpath}/{type}_metrics.json` — combined metrics per dataset type
310+
177311
### Re-Running
178312

179-
- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten
313+
- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten in `output_dirpath`
180314
- **Metrics files** (`*_metrics.json`) are compared — delete them if your data changed
181315

182316
### Generated Metrics

0 commit comments

Comments
 (0)