UnsafeChain

Enhancing Reasoning Model Safety via Hard Cases

🔍 Overview

UnsafeChain is a correction-first dataset designed to fine-tune large reasoning models to be safer and more factual. Unlike prior work that filters for safe completions (e.g., SafeChain), we teach models the distinction between unsafe and safe responses through explicit correction using GPT-4.1.

📁 Repo Structure

finetune/: Finetuning scripts for all models and datasets.
evaluation/: Evaluation scripts for all 11 benchmarks (WildJailbreak, StrongReject, TruthfulQA, MBPP, GSM8K, Alignment/Coherence, WildChat, JailbreakBench, MATH-500, HumanEval, Emergent Misalignment).
utils/: Helpers for moderation and other utilities.
requirements.txt: All Python dependencies.
.env.example: Example for environment variables (HuggingFace and OpenAI keys).

🚀 Setup

1. Clone the repo

git clone https://github.com/yuxiaw/UnsafeChain.git
cd UnsafeChain

2. Install requirements

pip install -r requirements.txt

3. Environment Variables

Copy .env.example to .env and fill in your keys:

HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key

Or export them in your shell:

export HF_TOKEN=your_huggingface_token
export OPENAI_API_KEY=your_openai_api_key

🏋️‍♂️ Finetuning

Run finetuning on any supported dataset and model with:

python finetune/finetune.py --model <hf_model_name_or_path> --dataset <hf_dataset_name> [--subset <subset>] --output <output_dir>

Example:

python finetune/finetune.py --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --dataset UCSC-VLAA/STAR-1 --output ./finetuned_R1-8B-STAR1

🧪 Evaluation (11 Benchmarks)

All evaluation scripts are in evaluation/ and take --model and other relevant arguments. Example usage for each:

# 1. WildJailbreak
python evaluation/eval_wildjailbreak.py --model <model_path_or_name>

# 2. StrongReject
python evaluation/eval_strongreject.py --model <model_path_or_name>

# 3. TruthfulQA MC
python evaluation/eval_truthfulqa_mc.py --model <model_path_or_name>

# 4. TruthfulQA (requires test CSV)
python evaluation/eval_truthfulqa.py --model <model_path_or_name> --input_csv <truthfulqa_test.csv>

# 5. MBPP (requires test CSV)
python evaluation/eval_mbpp.py --model <model_path_or_name> --input_csv <mbpp_test.csv>

# 6. GSM8K
python evaluation/eval_gsm8k.py --model <model_path_or_name>

# 7. Alignment/Coherence (Emergent Misalignment YAMLs required)
python evaluation/eval_alignment_coherence.py --model <model_path_or_name> --yaml_dir <yaml_dir>

# 8. WildChat
python evaluation/eval_wildchat.py --model <model_path_or_name>

# 9. JailbreakBench
python evaluation/eval_jailbreakbench.py --model <model_path_or_name>

# 10. MATH-500
python evaluation/eval_math500.py --model <model_path_or_name>

# 11. HumanEval
python evaluation/eval_humaneval.py --model <model_path_or_name>

All scripts are robust: just specify the model path/name and required arguments, and the evaluation will run end-to-end.

Notes

All scripts use HuggingFace Datasets (no CSVs required unless noted).
All prompts and hyperparameters are as in the paper.
For OpenAI and HuggingFace API keys, use environment variables as above.

Cite

If you use UnsafeChain in your research cite as:

@article{tomar2025safechain++,
      title = {UnsafeChain:Enhancing Reasoning Model Safety via Hard Cases}, 
      author = {Raj Vardhan Tomar and Preslav Nakov and Yuxia Wang},
      journal={arXiv preprint arXiv:2507.21652},
      year={2025},
      url={https://doi.org/10.48550/arXiv.2507.21652}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnsafeChain

🔍 Overview

📁 Repo Structure

🚀 Setup

1. Clone the repo

2. Install requirements

3. Environment Variables

🏋️‍♂️ Finetuning

🧪 Evaluation (11 Benchmarks)

Notes

Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluation		evaluation
finetune		finetune
utils		utils
.env.example		.env.example
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

mbzuai-nlp/UnsafeChain

Folders and files

Latest commit

History

Repository files navigation

UnsafeChain

🔍 Overview

📁 Repo Structure

🚀 Setup

1. Clone the repo

2. Install requirements

3. Environment Variables

🏋️‍♂️ Finetuning

🧪 Evaluation (11 Benchmarks)

Notes

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages