Skip to content

Accepted at 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026

License

Notifications You must be signed in to change notification settings

sharanya-dasgupta001/ARREST

Repository files navigation

🚨 ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Paper Python 3.10.12 License: MIT

This repository contains the source code for the EACL 2026 Main paper ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models by Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath and Swagatam Das.


📜 Abstract

ARREST is a unified framework that adversarially steers LLM internal representations toward safety and truthfulness during inference, correcting representational misalignment without fine-tuning base model parameters.

  • Hallucinations → Truthful output
  • ⚠️ Unsafe generations → Safe responses

ARREST supports :

  • 🤝 Soft refusals
  • ✅ Truthfulness restoration

🧭 Table of Contents

  1. ⚙️ Setup
  2. 🧠 Hallucination Evaluation
  3. 🛡️ Safety Evaluation
  4. 📝 Notes

⚙️ Setup

In the root folder of this repo, run the following commands to set things up.

  • Install Python 3.10.12 and the necessary packages from requirements.txt.

  • For easily managing different python versions, we recommend using conda.

  • Create a new environment in conda 🐍 and install necessary python packages:

    conda create -n arrest python=3.10.12 -y
    conda activate arrest
    pip install -r requirements.txt
  • 📁 Directory Setup :

    mkdir models
    mkdir -p hallucination/hidden
    mkdir -p hallucination/responses
    mkdir -p safety/hidden
    mkdir -p safety/responses
  • 🔐 HuggingFace Access Token :

    • Login to huggingface or create an account if you don't have already.
    • From the settings create a new access token with WRITE access.
    • Open the the files and paste your access token at beginning hf_token = "<INPUT_YOUR_HF_ACCESS_TOKEN>"

🧠 Hallucination Evaluation

  1. 🔍 Get Activations :

    model=llama2_7B dataset=truthfulqa bash get_activations_hal.sh
  2. 🧪 Train and Infer ARREST-Adversarial-Hallucination :

    python hallucination_adversarial.py --model_name llama2_7B  --dataset_name truthfulqa --num_layers 1 --num_fold 5

🛡️ Safety Evaluation

  1. 🔍 Get Activations :

    model=llama2_7B dataset=malicious-instruct bash get_activations_safety.sh
  2. ⚔️ Train and Infer ARREST-Adversarial-Safety :

    python safety_adversarial.py --model_name llama2_7B  --dataset_name malicious-instruct --num_layers 1 --num_fold 5
  3. 🧲 Train and Infer ARREST-Contrastive-Safety :

    python safety_contrastive.py --model_name llama2_7B  --dataset_name malicious-instruct --num_layers 1 --num_fold 5

📝 Notes

  1. Ground Truth Evaluation
    To evaluate generated answers with ground truth (for hallucination), we use BleuRT to evaluate truthfulness.

    • To install BleuRT run:
    pip install --upgrade pip  # ensures that pip is current
    git clone https://github.com/google-research/bleurt.git
    cd bleurt
    pip install .
    • 💡 using 12-layer distilled model for faster inference, which is ~3.5X smaller.
    • Download the model and save it in the ./models folder:
    wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip
    unzip BLEURT-20-D12.zip
    mv BLEURT-20-D12 models/. # Move the bleurt model folder to models directory
    
  2. 🧬 Intervention
    pyvene is really cool library that can be used to load Inference-time Intervention ⚙️ , and many other mechanistic intervention 🧩 technique.


📚 Citation

If you find this code or paper useful, please cite our work:

@article{dasgupta2026arrest,
  title={ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models},
  author={Dasgupta, Sharanya and Basu, Arkaprabha and Nath, Sujoy and Das, Swagatam},
  journal={arXiv preprint arXiv:2601.04394},
  year={2026}
}

About

Accepted at 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published