This repository contains the source code for the EACL 2026 Main paper ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models by Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath and Swagatam Das.
ARREST is a unified framework that adversarially steers LLM internal representations toward safety and truthfulness during inference, correcting representational misalignment without fine-tuning base model parameters.
- ❌ Hallucinations → Truthful output
⚠️ Unsafe generations → Safe responses
ARREST supports :
- 🤝 Soft refusals
- ✅ Truthfulness restoration
In the root folder of this repo, run the following commands to set things up.
-
Install
Python 3.10.12and the necessary packages fromrequirements.txt. -
For easily managing different python versions, we recommend using conda.
-
Create a new environment in conda 🐍 and install necessary python packages:
conda create -n arrest python=3.10.12 -y conda activate arrest pip install -r requirements.txt
-
📁 Directory Setup :
mkdir models mkdir -p hallucination/hidden mkdir -p hallucination/responses mkdir -p safety/hidden mkdir -p safety/responses
-
🔐 HuggingFace Access Token :
- Login to
huggingfaceor create an account if you don't have already. - From the settings create a new access token with WRITE access.
- Open the the files and paste your access token at beginning
hf_token = "<INPUT_YOUR_HF_ACCESS_TOKEN>"
- Login to
-
🔍 Get Activations :
model=llama2_7B dataset=truthfulqa bash get_activations_hal.sh
-
🧪 Train and Infer ARREST-Adversarial-Hallucination :
python hallucination_adversarial.py --model_name llama2_7B --dataset_name truthfulqa --num_layers 1 --num_fold 5
-
🔍 Get Activations :
model=llama2_7B dataset=malicious-instruct bash get_activations_safety.sh
model: Choose fromllama2_7B,llama3_8B,Qwen2.5_7B, orYi1.5_9B.dataset: Choose frommalicious-instruct,advbench,jailbreak-bench,trustllm.
-
⚔️ Train and Infer ARREST-Adversarial-Safety :
python safety_adversarial.py --model_name llama2_7B --dataset_name malicious-instruct --num_layers 1 --num_fold 5
model_name: Choose fromllama2_7B,llama3_8B,Qwen2.5_7B, orYi1.5_9B.dataset_name: Choose frommalicious-instruct,advbench,jailbreak-bench,trustllm.- Attack Success Rate(%) will be printed on screen and responses will be saved into
safety/responsesfolder.
-
🧲 Train and Infer ARREST-Contrastive-Safety :
python safety_contrastive.py --model_name llama2_7B --dataset_name malicious-instruct --num_layers 1 --num_fold 5
model_name: Choose fromllama2_7B,llama3_8B,Qwen2.5_7B, orYi1.5_9B.dataset_name: Choose frommalicious-instruct,advbench,jailbreak-bench,trustllm.- Attack Success Rate(%) will be printed on screen and responses will be saved into
safety/responsesfolder.
-
✅ Ground Truth Evaluation
To evaluate generated answers with ground truth (for hallucination), we use BleuRT to evaluate truthfulness.- To install BleuRT run:
pip install --upgrade pip # ensures that pip is current git clone https://github.com/google-research/bleurt.git cd bleurt pip install .
- 💡 using 12-layer distilled model for faster inference, which is ~3.5X smaller.
- Download the model and save it in the
./modelsfolder:
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip unzip BLEURT-20-D12.zip mv BLEURT-20-D12 models/. # Move the bleurt model folder to models directory- If you want to use any different model please refer to BleuRT repository.
-
🧬 Intervention
pyvene is really cool library that can be used to load Inference-time Intervention ⚙️ , and many other mechanistic intervention 🧩 technique.
If you find this code or paper useful, please cite our work:
@article{dasgupta2026arrest,
title={ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models},
author={Dasgupta, Sharanya and Basu, Arkaprabha and Nath, Sujoy and Das, Swagatam},
journal={arXiv preprint arXiv:2601.04394},
year={2026}
}