Skip to content

UKPLab/arxiv2026-instruction-vectors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arxiv2026_instruction_vectors

Arxiv License Python Versions

This is the accompanying code repository for the paper Patches of Nonlinearity: Instruction Vectors in Large Language Models.

Abstract: Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.

Contact person: Irina Bigoulaeva

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Getting Started

Install the necessary dependencies in uv.lock:

uv sync

Optional vLLM

A subset of our experiments (i.e. experiments/basic_inference.py) requires vLLM as a dependency. It's recommended to install this in a clean environment first, as specific CUDA versions are required that may introduce incompatibilities. For this reason, our uv.lock file excludes vLLM.

We recommend installing vLLM in a separate environment for running the inference experiments.

uv pip install vllm

Usage

The paper's experiments and graphs can be reproduced by running the scripts under the experiments folder. The outputs of the scripts are saved under experiments/output.

From the project root directory:

source .venv/bin/activate
uv run python -m experiments.[MODULE_NAME] [--ARGS]

Now, the scripts can be run from the command line.

Datasets included in the repository

This repository contains the datasets of our contrastive tasks, which were automatically generated and manually verified for quality by the authors (see paper for details). The tasks are located under src/data/local_tasks.

The tasks include:

  • adjectives.csv
  • animals.csv
  • math.csv -- (not used in paper)
  • mushrooms.csv -- (not used in paper)

Each csv file has the following header: query,subtask_1,subtask_2.

Additionally, we include the datasets of our instruction rephrasals. These datasets were also automatically generated and manually verified by the authors.

The tasks include:

  • adj_ant_rephrased.csv
  • adj_comp_rephrased.csv
  • anim_color_rephrased.csv
  • can_fly_rephrased.csv
  • implicatures_rephrased.csv
  • metaphor_boolean_rephrased.csv
  • object_counting_rephrased.csv
  • snarks_rephrased.csv

Each csv file has the following header: index,instruction.

Running experiments

Before running experiments, specify necessary hyperparameters in args.py.

--model_idx: Specify the index of the model (as listed in config.py).

--task: Name of the main task (corresponds to dataset/task in data/local_tasks/).

  • adjectives
  • animals
  • metaphor_boolean
  • object_counting
  • implicatures
  • snarks

--subtask: Name of the subtask if the main task is either adjectives or animals.

  • adjectives: adj_comp or adj_ant
  • animals: anim_color or can_fly

--num_samples: Number of task samples to use in the experiment.

Additionally, there are some parameters that are required for specific experiments.

Activation Patching

--num_choices: When conducting activation patching, define whether to do 2-layer patching (num_choices=2) or 3-layer patching (num_choices=3). This represents a combinatorial n-choose-k search of tuple pairs/triplets among the n model layers.

# 2-layer patching on Olmo-2 1B, on 100 samples from the Adjective: Comparative task
uv run python -m experiments.activation_patching --model_idx 17 --task "adjectives" --subtask "adj_comp" --num_choices 2 --num_samples 100
# 3-layer patching on Olmo-2 7B DPO, on 100 samples from the Metaphor Boolean task
uv run python -m experiments.activation_patching --model_idx 19 --task "metaphor_boolean" --num_choices 3 --num_samples 100

Statistical Test

After activation patching is done, we perform a t-test on the results. Note that for this script to run, output files (.pt) from activation patching must exist.

For conciseness, hyperparameters (models, tasks) are specified in the file directly.

Example command:

uv run python -m experiments.statistical_test

Linear Probe and Dimensionality Reduction

For our linear probe experiments, we must produce and load from a dataset of varied instructional samples. These variations of the instruction can either keep the label space constant (e.g. keep a yes/no question a yes/no question), or can change the label space (e.g. turn a yes/no question into an T/F question).

--layer_for_dataset: Optionally, specify a specific model layer idx to save representations from. If specified, no other layers will be done.

--varying:

  • instructions - if label space is not changed
  • labels - if label space is changed.

Note that in our paper, we only vary instructions, so there is a minimal selection of tasks for which a varied label space is predefined in the codebase.

Example commands:

Make the representations dataset:

The tasks must be specified directly in linear_probe.py.

# Use Olmo-2 7B to create varied instruction representations (involves activation patching)

uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --make_dataset

Run probe on the dataset:

# Run probe on the Olmo-2 7B representations. The remaining args ensure that the correct dataset is loaded

uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --do_probe

Plot the clusters using LDA:

# Run probe on the Olmo-2 7B representations. The remaining args ensure that the correct dataset is loaded

uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --plot_lda_clusters

Target Task Inference

--max_tokens: Specify the max number of output tokens allowed.

--test_batch_size: Data loader batch size.

Example command:

uv run python -m experiments.basic_inference --model_idx 19 --task "animals" --subtask "can_fly" --test_batch_size 5 --max_tokens 2

Path Tracing

--start_pos: The token position from which to start tracing. Due to long runtimes and an exponentially larger number of paths towards the beginning of the prompt, we recommend tracing from later token positions.

--end_pos: The token position at which to stop tracing (inclusive).

--threshold_rank: The rank threshold for saving paths. Default is 100, but can be set to lower. Setting to higher will result in more paths being computed and greater runtimes/memory load.

The code always assumes that we are tracing one data sample at a time. Therefore, iteration over many samples is done within the bash script.

Tracing is only done on contrastive tasks, and both subtasks are done by default. So, no --subtask parameter is passed.

# Do path tracing with Olmo-2 1B DPO on the Adjectives tasks, on samples 0 - 49 (inclusive).

for i in $(seq 0 49);
do
        uv run python -m experiments.path_tracing --model_idx 16 --task "adjectives" --tracing_sample_idx $i --start_pos 11 --end_pos -1 --threshold_rank 100
done

Cite

If you found our data or code helpful, please cite our paper:

@InProceedings{smith:20xx:CONFERENCE_TITLE,
  author    = {Smith, John},
  title     = {My Paper Title},
  booktitle = {Proceedings of the 20XX Conference on XXXX},
  month     = mmm,
  year      = {20xx},
  address   = {Gotham City, USA},
  publisher = {Association for XXX},
  pages     = {XXXX--XXXX},
  url       = {http://xxxx.xxx}
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages