docs: issue 626 (#638)

lbliii · web-flow · commit 970b437e774d · 2026-02-03T15:40:33.000-08:00
Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;
diff --git a/docs/data/download-huggingface.md b/docs/data/download-huggingface.md
@@ -92,10 +92,71 @@ ng_download_dataset_from_hf \
 
 ::::
 
+::::{tab-item} Python Script
+Downloads using the `datasets` library directly with streaming support.
+
+**Use when**: You need custom preprocessing, streaming for large datasets, or specific split handling.
+
+```python
+import json
+from datasets import load_dataset
+
+output_file = "train.jsonl"
+dataset_name = "nvidia/OpenMathInstruct-2"
+split_name = "train_1M"  # Check dataset page for available splits
+
+with open(output_file, "w", encoding="utf-8") as f:
+    for line in load_dataset(dataset_name, split=split_name, streaming=True):
+        f.write(json.dumps(line) + "\n")
+```
+
+Run the script:
+
+```bash
+uv run download.py
+```
+
+Verify the download:
+
+```bash
+wc -l train.jsonl
+# Expected: 1000000 train.jsonl
+```
+
+**Streaming benefits**:
+- Memory-efficient for large datasets (millions of rows)
+- Progress visible during download
+
+:::{note}
+For gated or private datasets, authenticate first:
+
+```bash
+export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
+```
+
+Or use `huggingface-cli login` before running the script.
+:::
+
+::::
+
 :::::
 
 ---
 
+## NVIDIA Datasets
+
+Ready-to-use datasets for common training tasks:
+
+| Dataset | Repository | Domain |
+|---------|-----------|--------|
+| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` | Math |
+| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` | Code |
+| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` | Agent |
+| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` | Instruction |
+| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` | Knowledge |
+
+---
+
 ## Troubleshooting
 
 ::::{dropdown} Authentication Failed (401)
@@ -147,7 +208,7 @@ Avoid passing tokens on the command line—they appear in shell history.
 **Recommended** — Use environment variable:
 
 ```bash
-export hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
+export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
 ng_download_dataset_from_hf \
     +repo_id=my-org/private-dataset \
     +output_dirpath=./data/
@@ -168,20 +229,6 @@ ng_download_dataset_from_hf \
 ```
 :::
 
----
-
-## NVIDIA Datasets
-
-| Dataset | Repository |
-|---------|-----------|
-| OpenMathReasoning | `nvidia/Nemotron-RL-math-OpenMathReasoning` |
-| Competitive Coding | `nvidia/nemotron-RL-coding-competitive_coding` |
-| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant` |
-| Structured Outputs | `nvidia/Nemotron-RL-instruction_following-structured_outputs` |
-| MCQA | `nvidia/Nemotron-RL-knowledge-mcqa` |
-
----
-
 :::{dropdown} Automatic Downloads During Data Preparation
 :icon: download
 
@@ -238,7 +285,30 @@ rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
 | Auto-download | `nemo_gym/train_data_utils.py:476-494` |
 :::
 
-## Related
+## Next Steps
+
+::::{grid} 1 2 2 2
+:gutter: 3
+
+:::{grid-item-card} {octicon}`checklist;1.5em;sd-mr-1` Prepare and Validate
+:link: prepare-validate
+:link-type: doc
+
+Preprocess raw data, run `ng_prepare_data`, and add `agent_ref` routing.
+:::
+
+:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Collect Rollouts
+:link: /get-started/rollout-collection
+:link-type: doc
 
-- {doc}`prepare-validate` — Validate downloaded datasets
-- {doc}`/reference/cli-commands` — Full CLI reference
+Generate training examples by running your agent on prepared data.
+:::
+
+:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Train with NeMo RL
+:link: /tutorials/nemo-rl-grpo/index
+:link-type: doc
+
+Use validated data with NeMo RL for GRPO training.
+:::
+
+::::
diff --git a/docs/data/index.md b/docs/data/index.md
@@ -1,7 +1,7 @@
 (data-index)=
 # Data
 
-NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an agent server—the component that orchestrates agent-environment interactions during training.
+NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an **agent server** (orchestrates agent-environment interactions) which routes requests to a **resources server** (provides tools and computes rewards).
 
 ## Prerequisites
 
@@ -28,6 +28,38 @@ Additional fields like `expected_answer` vary by resources server—the componen
 
 **Source**: `nemo_gym/base_resources_server.py:35-36`
 
+### Required Fields
+
+| Field | Added By | Description |
+|-------|----------|-------------|
+| `responses_create_params` | User | Input to the model during training. Contains `input` (messages) and optional `tools`, `temperature`, etc. |
+| `agent_ref` | `ng_prepare_data` | Routes each row to its resource server. Auto-generated during data preparation. |
+
+### Optional Fields
+
+| Field | Description |
+|-------|-------------|
+| `expected_answer` | Ground truth for verification (task-specific). |
+| `question` | Original question text (for reference). |
+| `id` | Tracking identifier. |
+
+:::{tip}
+Check `resources_servers/<name>/README.md` for fields required by each resource server's `verify()` method.
+:::
+
+### The `agent_ref` Field
+
+The `agent_ref` field maps each row to a specific resource server. A training dataset can blend multiple resource servers in a single file—`agent_ref` tells NeMo Gym which server handles each row.
+
+```json
+{
+  "responses_create_params": {"input": [{"role": "user", "content": "..."}]},
+  "agent_ref": {"type": "responses_api_agents", "name": "math_with_judge_simple_agent"}
+}
+```
+
+**You don't create `agent_ref` manually.** The `ng_prepare_data` tool adds it automatically based on your config file. The tool matches the agent type (`responses_api_agents`) with the agent name from the config.
+
 ### Example Data
 
 ```json
diff --git a/docs/data/prepare-validate.md b/docs/data/prepare-validate.md
@@ -32,7 +32,9 @@ Success output:
 ####################################################################################################
 ```
 
-This generates `data/test/example_metrics.json` with dataset statistics.
+This generates two types of output:
+- **Per-dataset metrics**: `resources_servers/example_multi_step/data/example_metrics.json` (alongside source JSONL)
+- **Aggregated metrics**: `data/test/example_metrics.json` (in output directory)
 
 ---
 
@@ -85,6 +87,130 @@ Check `resources_servers/<name>/README.md` for required fields specific to each
 
 ---
 
+## Preprocess Raw Datasets
+
+If your dataset doesn't have `responses_create_params`, you need to preprocess it before using `ng_prepare_data`.
+
+**When to preprocess**:
+- Downloaded datasets without NeMo Gym format
+- Custom data needing system prompts
+- Need to split into train/validation sets
+
+### Add `responses_create_params`
+
+The `responses_create_params` field wraps your input in the Responses API format. This typically includes a system prompt and the user content.
+
+::::{dropdown} Preprocessing script (preprocess.py)
+:icon: code
+:open:
+
+Save this script as `preprocess.py`. It reads a raw JSONL file, adds `responses_create_params`, and splits into train/validation:
+
+```python
+import json
+import os
+
+# Configuration — customize these for your dataset
+INPUT_FIELD = "problem"  # Field containing the input text (e.g., "problem", "question", "prompt")
+FILENAME = "raw_data.jsonl"
+SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
+TRAIN_RATIO = 0.999  # 99.9% train, 0.1% validation
+
+dirpath = os.path.dirname(FILENAME) or "."
+with open(FILENAME, "r", encoding="utf-8") as fin, \
+    open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
+    open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
+    
+    lines = list(fin)
+    split_idx = int(len(lines) * TRAIN_RATIO)
+    
+    for i, line in enumerate(lines):
+        if not line.strip():
+            continue
+        row = json.loads(line)
+        
+        # Remove fields not needed for training (optional)
+        row.pop("generated_solution", None)
+        row.pop("problem_source", None)
+        
+        # Add responses_create_params
+        row["responses_create_params"] = {
+            "input": [
+                {"role": "developer", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": row.get(INPUT_FIELD, "")},
+            ]
+        }
+        
+        out = json.dumps(row) + "\n"
+        (ftrain if i < split_idx else fval).write(out)
+```
+
+:::{important}
+You must customize these variables for your dataset:
+- `INPUT_FIELD`: The field name containing your input text. Common values: `"problem"` (math), `"question"` (QA), `"prompt"` (general), `"instruction"` (instruction-following)
+- `SYSTEM_PROMPT`: Task-specific instructions for the model
+- `TRAIN_RATIO`: Train/validation split ratio
+:::
+
+::::
+
+Run and verify:
+
+```bash
+uv run preprocess.py
+wc -l train.jsonl validation.jsonl
+```
+
+### Create Config for Custom Data
+
+After preprocessing, create a config file to point `ng_prepare_data` at your local files.
+
+::::{dropdown} Example config: custom_data.yaml
+:icon: file-code
+
+```yaml
+custom_resources_server:
+  resources_servers:
+    custom_server:
+      entrypoint: app.py
+      domain: math  # math | coding | agent | knowledge | other
+      description: Custom math dataset
+      verified: false
+
+custom_simple_agent:
+  responses_api_agents:
+    simple_agent:
+      entrypoint: app.py
+      resources_server:
+        type: resources_servers
+        name: custom_resources_server
+      model_server:
+        type: responses_api_models
+        name: policy_model
+      datasets:
+      - name: train
+        type: train
+        jsonl_fpath: train.jsonl
+        license: Creative Commons Attribution 4.0 International
+      - name: validation
+        type: validation
+        jsonl_fpath: validation.jsonl
+        license: Creative Commons Attribution 4.0 International
+```
+
+::::
+
+Run data preparation:
+
+```bash
+config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,custom_data.yaml"
+ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data
+```
+
+This validates your data and adds the `agent_ref` field to each row, routing samples to your resource server.
+
+---
+
 ## Validation Modes
 
 | Mode | Purpose | Validates |
@@ -130,7 +256,9 @@ ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/wo
 | Invalid role | Sample skipped | Use `user`, `assistant`, `system`, or `developer` |
 | Missing dataset file | `AssertionError` | Create file or set `+should_download=true` |
 
-**Key behavior**: Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data.
+:::{warning}
+Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
+:::
 
 ::::{dropdown} Find invalid samples
 :icon: code
@@ -174,9 +302,15 @@ with open("your_data.jsonl") as f:
 4. **Compute metrics** — Aggregate statistics
 5. **Collate** — Combine samples with agent references
 
+### Output Locations
+
+Metrics files are written to two locations:
+- **Per-dataset**: `{dataset_jsonl_path}_metrics.json` — alongside each source JSONL file
+- **Aggregated**: `{output_dirpath}/{type}_metrics.json` — combined metrics per dataset type
+
 ### Re-Running
 
-- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten
+- **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten in `output_dirpath`
 - **Metrics files** (`*_metrics.json`) are compared — delete them if your data changed
 
 ### Generated Metrics
diff --git a/docs/tutorials/custom-dataset-preparation.md b/docs/tutorials/custom-dataset-preparation.md