System Prompt and Format Reward inconsistent in stage 1

1.1 thinking vs think
Parquet 训练数据（data/bfcl_train_base.parquet、data/bfcl_val.parquet）中全部 200 条 system prompt 使用 thinking /thinking 作为 XML 标签，但以下两处代码只匹配 think /think：

- env_tuning/interaction/utils.py:46 — parse_model_response() 正则 think ... /think， 不匹配 thinking
- env_tuning/interaction/execution_manager.py:110 — format_execution_response() 中 user_hint 使用 think /think

影响：模型按 system prompt 输出 thinking，parser 只匹配 think，所有输出在第一步即被判定格式错误（score = -3）

1.2 answer 标签未在 system prompt 中定义
System prompt 对不需要工具调用的情况指示为：

- "If no tool calls are necessary or possible: Directly provide a user-facing response in plain text."

但 env_tuning/interaction/utils.py:85-89 的 parse_model_response() 要求必须用 answer.../answer 包裹最终回答，否则返回格式错误。answer 仅在env_tuning/interaction/execution_manager.py:110-112 的 user_hint 中后续引入，模型在初始 turn 无从得知需要使用该标签。




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Prompt and Format Reward inconsistent in stage 1 #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System Prompt and Format Reward inconsistent in stage 1 #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions