Enhancing Reasoning Model Safety via Hard Cases
UnsafeChain is a correction-first dataset designed to fine-tune large reasoning models to be safer and more factual. Unlike prior work that filters for safe completions (e.g., SafeChain), we teach models the distinction between unsafe and safe responses through explicit correction using GPT-4.1.
finetune/: Finetuning scripts for all models and datasets.evaluation/: Evaluation scripts for all 11 benchmarks (WildJailbreak, StrongReject, TruthfulQA, MBPP, GSM8K, Alignment/Coherence, WildChat, JailbreakBench, MATH-500, HumanEval, Emergent Misalignment).utils/: Helpers for moderation and other utilities.requirements.txt: All Python dependencies..env.example: Example for environment variables (HuggingFace and OpenAI keys).
git clone https://github.com/yuxiaw/UnsafeChain.git
cd UnsafeChainpip install -r requirements.txtCopy .env.example to .env and fill in your keys:
HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key
Or export them in your shell:
export HF_TOKEN=your_huggingface_token
export OPENAI_API_KEY=your_openai_api_keyRun finetuning on any supported dataset and model with:
python finetune/finetune.py --model <hf_model_name_or_path> --dataset <hf_dataset_name> [--subset <subset>] --output <output_dir>Example:
python finetune/finetune.py --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --dataset UCSC-VLAA/STAR-1 --output ./finetuned_R1-8B-STAR1All evaluation scripts are in evaluation/ and take --model and other relevant arguments. Example usage for each:
# 1. WildJailbreak
python evaluation/eval_wildjailbreak.py --model <model_path_or_name>
# 2. StrongReject
python evaluation/eval_strongreject.py --model <model_path_or_name>
# 3. TruthfulQA MC
python evaluation/eval_truthfulqa_mc.py --model <model_path_or_name>
# 4. TruthfulQA (requires test CSV)
python evaluation/eval_truthfulqa.py --model <model_path_or_name> --input_csv <truthfulqa_test.csv>
# 5. MBPP (requires test CSV)
python evaluation/eval_mbpp.py --model <model_path_or_name> --input_csv <mbpp_test.csv>
# 6. GSM8K
python evaluation/eval_gsm8k.py --model <model_path_or_name>
# 7. Alignment/Coherence (Emergent Misalignment YAMLs required)
python evaluation/eval_alignment_coherence.py --model <model_path_or_name> --yaml_dir <yaml_dir>
# 8. WildChat
python evaluation/eval_wildchat.py --model <model_path_or_name>
# 9. JailbreakBench
python evaluation/eval_jailbreakbench.py --model <model_path_or_name>
# 10. MATH-500
python evaluation/eval_math500.py --model <model_path_or_name>
# 11. HumanEval
python evaluation/eval_humaneval.py --model <model_path_or_name>All scripts are robust: just specify the model path/name and required arguments, and the evaluation will run end-to-end.
- All scripts use HuggingFace Datasets (no CSVs required unless noted).
- All prompts and hyperparameters are as in the paper.
- For OpenAI and HuggingFace API keys, use environment variables as above.
If you use UnsafeChain in your research cite as:
@article{tomar2025safechain++,
title = {UnsafeChain:Enhancing Reasoning Model Safety via Hard Cases},
author = {Raj Vardhan Tomar and Preslav Nakov and Yuxia Wang},
journal={arXiv preprint arXiv:2507.21652},
year={2025},
url={https://doi.org/10.48550/arXiv.2507.21652}
}