Skip to content

mbzuai-nlp/UnsafeChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UnsafeChain

Enhancing Reasoning Model Safety via Hard Cases

arXiv

Hugging Face Logo

🔍 Overview

UnsafeChain is a correction-first dataset designed to fine-tune large reasoning models to be safer and more factual. Unlike prior work that filters for safe completions (e.g., SafeChain), we teach models the distinction between unsafe and safe responses through explicit correction using GPT-4.1.


📁 Repo Structure

  • finetune/: Finetuning scripts for all models and datasets.
  • evaluation/: Evaluation scripts for all 11 benchmarks (WildJailbreak, StrongReject, TruthfulQA, MBPP, GSM8K, Alignment/Coherence, WildChat, JailbreakBench, MATH-500, HumanEval, Emergent Misalignment).
  • utils/: Helpers for moderation and other utilities.
  • requirements.txt: All Python dependencies.
  • .env.example: Example for environment variables (HuggingFace and OpenAI keys).

🚀 Setup

1. Clone the repo

git clone https://github.com/yuxiaw/UnsafeChain.git
cd UnsafeChain

2. Install requirements

pip install -r requirements.txt

3. Environment Variables

Copy .env.example to .env and fill in your keys:

HF_TOKEN=your_huggingface_token
OPENAI_API_KEY=your_openai_api_key

Or export them in your shell:

export HF_TOKEN=your_huggingface_token
export OPENAI_API_KEY=your_openai_api_key

🏋️‍♂️ Finetuning

Run finetuning on any supported dataset and model with:

python finetune/finetune.py --model <hf_model_name_or_path> --dataset <hf_dataset_name> [--subset <subset>] --output <output_dir>

Example:

python finetune/finetune.py --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --dataset UCSC-VLAA/STAR-1 --output ./finetuned_R1-8B-STAR1

🧪 Evaluation (11 Benchmarks)

All evaluation scripts are in evaluation/ and take --model and other relevant arguments. Example usage for each:

# 1. WildJailbreak
python evaluation/eval_wildjailbreak.py --model <model_path_or_name>

# 2. StrongReject
python evaluation/eval_strongreject.py --model <model_path_or_name>

# 3. TruthfulQA MC
python evaluation/eval_truthfulqa_mc.py --model <model_path_or_name>

# 4. TruthfulQA (requires test CSV)
python evaluation/eval_truthfulqa.py --model <model_path_or_name> --input_csv <truthfulqa_test.csv>

# 5. MBPP (requires test CSV)
python evaluation/eval_mbpp.py --model <model_path_or_name> --input_csv <mbpp_test.csv>

# 6. GSM8K
python evaluation/eval_gsm8k.py --model <model_path_or_name>

# 7. Alignment/Coherence (Emergent Misalignment YAMLs required)
python evaluation/eval_alignment_coherence.py --model <model_path_or_name> --yaml_dir <yaml_dir>

# 8. WildChat
python evaluation/eval_wildchat.py --model <model_path_or_name>

# 9. JailbreakBench
python evaluation/eval_jailbreakbench.py --model <model_path_or_name>

# 10. MATH-500
python evaluation/eval_math500.py --model <model_path_or_name>

# 11. HumanEval
python evaluation/eval_humaneval.py --model <model_path_or_name>

All scripts are robust: just specify the model path/name and required arguments, and the evaluation will run end-to-end.


Notes

  • All scripts use HuggingFace Datasets (no CSVs required unless noted).
  • All prompts and hyperparameters are as in the paper.
  • For OpenAI and HuggingFace API keys, use environment variables as above.

Cite

If you use UnsafeChain in your research cite as:

@article{tomar2025safechain++,
      title = {UnsafeChain:Enhancing Reasoning Model Safety via Hard Cases}, 
      author = {Raj Vardhan Tomar and Preslav Nakov and Yuxia Wang},
      journal={arXiv preprint arXiv:2507.21652},
      year={2025},
      url={https://doi.org/10.48550/arXiv.2507.21652}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages