🚨 ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

This repository contains the source code for the EACL 2026 Main paper ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models by Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath and Swagatam Das.

📜 Abstract

ARREST is a unified framework that adversarially steers LLM internal representations toward safety and truthfulness during inference, correcting representational misalignment without fine-tuning base model parameters.

❌ Hallucinations → Truthful output
⚠️ Unsafe generations → Safe responses

ARREST supports :

🤝 Soft refusals
✅ Truthfulness restoration

⚙️ Setup

In the root folder of this repo, run the following commands to set things up.

Install Python 3.10.12 and the necessary packages from requirements.txt.
For easily managing different python versions, we recommend using conda.

Create a new environment in conda 🐍 and install necessary python packages:

conda create -n arrest python=3.10.12 -y
conda activate arrest
pip install -r requirements.txt

📁 Directory Setup :

mkdir models
mkdir -p hallucination/hidden
mkdir -p hallucination/responses
mkdir -p safety/hidden
mkdir -p safety/responses

🔐 HuggingFace Access Token :
- Login to huggingface or create an account if you don't have already.
- From the settings create a new access token with WRITE access.
- Open the the files and paste your access token at beginning hf_token = "<INPUT_YOUR_HF_ACCESS_TOKEN>"

🧠 Hallucination Evaluation

🔍 Get Activations :
```
model=llama2_7B dataset=truthfulqa bash get_activations_hal.sh
```
- model: Choose from llama2_7B, llama3_8B, or vicuna_7B.
- dataset: Choose from truthfulqa, triviaqa, tydiqa, coqa.
🧪 Train and Infer ARREST-Adversarial-Hallucination :
```
python hallucination_adversarial.py --model_name llama2_7B  --dataset_name truthfulqa --num_layers 1 --num_fold 5
```
- model_name: Choose from llama2_7B, llama3_8B, or vicuna_7B.
- dataset_name: Choose from truthfulqa, triviaqa, tydiqa, coqa.
- Truthfulness (%) will be printed on screen and responses will be saved into hallucination/responses folder.

🛡️ Safety Evaluation

🔍 Get Activations :
```
model=llama2_7B dataset=malicious-instruct bash get_activations_safety.sh
```
- model: Choose from llama2_7B, llama3_8B, Qwen2.5_7B, or Yi1.5_9B.
- dataset: Choose from malicious-instruct, advbench, jailbreak-bench, trustllm.
⚔️ Train and Infer ARREST-Adversarial-Safety :
```
python safety_adversarial.py --model_name llama2_7B  --dataset_name malicious-instruct --num_layers 1 --num_fold 5
```
- model_name: Choose from llama2_7B, llama3_8B, Qwen2.5_7B, or Yi1.5_9B.
- dataset_name: Choose from malicious-instruct, advbench, jailbreak-bench, trustllm.
- Attack Success Rate(%) will be printed on screen and responses will be saved into safety/responses folder.
🧲 Train and Infer ARREST-Contrastive-Safety :
```
python safety_contrastive.py --model_name llama2_7B  --dataset_name malicious-instruct --num_layers 1 --num_fold 5
```
- model_name: Choose from llama2_7B, llama3_8B, Qwen2.5_7B, or Yi1.5_9B.
- dataset_name: Choose from malicious-instruct, advbench, jailbreak-bench, trustllm.
- Attack Success Rate(%) will be printed on screen and responses will be saved into safety/responses folder.

📝 Notes

✅ Ground Truth Evaluation
To evaluate generated answers with ground truth (for hallucination), we use BleuRT to evaluate truthfulness.
- To install BleuRT run:
```
pip install --upgrade pip  # ensures that pip is current
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .
```
- 💡 using 12-layer distilled model for faster inference, which is ~3.5X smaller.
- Download the model and save it in the ./models folder:
```
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip
unzip BLEURT-20-D12.zip
mv BLEURT-20-D12 models/. # Move the bleurt model folder to models directory
```
- If you want to use any different model please refer to BleuRT repository.
🧬 Intervention
pyvene is really cool library that can be used to load Inference-time Intervention ⚙️ , and many other mechanistic intervention 🧩 technique.

📚 Citation

If you find this code or paper useful, please cite our work:

@article{dasgupta2026arrest,
  title={ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models},
  author={Dasgupta, Sharanya and Basu, Arkaprabha and Nath, Sujoy and Das, Swagatam},
  journal={arXiv preprint arXiv:2601.04394},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ARREST_method.jpg		ARREST_method.jpg
LICENSE		LICENSE
README.md		README.md
get_activations_hal.sh		get_activations_hal.sh
get_activations_safety.sh		get_activations_safety.sh
hallucination_activations.py		hallucination_activations.py
hallucination_adversarial.py		hallucination_adversarial.py
requirements.txt		requirements.txt
safety_activations.py		safety_activations.py
safety_adversarial.py		safety_adversarial.py
safety_contrastive.py		safety_contrastive.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚨 ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

📜 Abstract

🧭 Table of Contents

⚙️ Setup

🧠 Hallucination Evaluation

🛡️ Safety Evaluation

📝 Notes

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

sharanya-dasgupta001/ARREST

Folders and files

Latest commit

History

Repository files navigation

🚨 ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

📜 Abstract

🧭 Table of Contents

⚙️ Setup

🧠 Hallucination Evaluation

🛡️ Safety Evaluation

📝 Notes

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages