This is the accompanying code repository for the paper Patches of Nonlinearity: Instruction Vectors in Large Language Models.
Abstract: Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
Contact person: Irina Bigoulaeva
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
Install the necessary dependencies in uv.lock:
uv syncA subset of our experiments (i.e. experiments/basic_inference.py) requires vLLM as a dependency. It's recommended to install this in a clean environment first, as specific CUDA versions are required that may introduce incompatibilities. For this reason, our uv.lock file excludes vLLM.
We recommend installing vLLM in a separate environment for running the inference experiments.
uv pip install vllmThe paper's experiments and graphs can be reproduced by running the scripts under the experiments folder. The outputs of the scripts are saved under experiments/output.
From the project root directory:
source .venv/bin/activate
uv run python -m experiments.[MODULE_NAME] [--ARGS]Now, the scripts can be run from the command line.
This repository contains the datasets of our contrastive tasks, which were automatically generated and manually verified for quality by the authors (see paper for details). The tasks are located under src/data/local_tasks.
The tasks include:
adjectives.csvanimals.csvmath.csv-- (not used in paper)mushrooms.csv-- (not used in paper)
Each csv file has the following header: query,subtask_1,subtask_2.
Additionally, we include the datasets of our instruction rephrasals. These datasets were also automatically generated and manually verified by the authors.
The tasks include:
adj_ant_rephrased.csvadj_comp_rephrased.csvanim_color_rephrased.csvcan_fly_rephrased.csvimplicatures_rephrased.csvmetaphor_boolean_rephrased.csvobject_counting_rephrased.csvsnarks_rephrased.csv
Each csv file has the following header: index,instruction.
Before running experiments, specify necessary hyperparameters in args.py.
--model_idx: Specify the index of the model (as listed in config.py).
--task: Name of the main task (corresponds to dataset/task in data/local_tasks/).
adjectivesanimalsmetaphor_booleanobject_countingimplicaturessnarks
--subtask: Name of the subtask if the main task is either adjectives or animals.
adjectives:adj_comporadj_antanimals:anim_colororcan_fly
--num_samples: Number of task samples to use in the experiment.
Additionally, there are some parameters that are required for specific experiments.
--num_choices: When conducting activation patching, define whether to do 2-layer patching (num_choices=2) or 3-layer patching (num_choices=3). This represents a combinatorial n-choose-k search of tuple pairs/triplets among the n model layers.
# 2-layer patching on Olmo-2 1B, on 100 samples from the Adjective: Comparative task
uv run python -m experiments.activation_patching --model_idx 17 --task "adjectives" --subtask "adj_comp" --num_choices 2 --num_samples 100# 3-layer patching on Olmo-2 7B DPO, on 100 samples from the Metaphor Boolean task
uv run python -m experiments.activation_patching --model_idx 19 --task "metaphor_boolean" --num_choices 3 --num_samples 100After activation patching is done, we perform a t-test on the results. Note that for this script to run, output files (.pt) from activation patching must exist.
For conciseness, hyperparameters (models, tasks) are specified in the file directly.
Example command:
uv run python -m experiments.statistical_testFor our linear probe experiments, we must produce and load from a dataset of varied instructional samples. These variations of the instruction can either keep the label space constant (e.g. keep a yes/no question a yes/no question), or can change the label space (e.g. turn a yes/no question into an T/F question).
--layer_for_dataset: Optionally, specify a specific model layer idx to save representations from. If specified, no other layers will be done.
--varying:
instructions- if label space is not changedlabels- if label space is changed.
Note that in our paper, we only vary instructions, so there is a minimal selection of tasks for which a varied label space is predefined in the codebase.
Example commands:
The tasks must be specified directly in linear_probe.py.
# Use Olmo-2 7B to create varied instruction representations (involves activation patching)
uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --make_dataset# Run probe on the Olmo-2 7B representations. The remaining args ensure that the correct dataset is loaded
uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --do_probe# Run probe on the Olmo-2 7B representations. The remaining args ensure that the correct dataset is loaded
uv run python -m experiments.linear_probe --model_idx 20 --varying "instructions" --model_component "resid_post" --plot_lda_clusters--max_tokens: Specify the max number of output tokens allowed.
--test_batch_size: Data loader batch size.
Example command:
uv run python -m experiments.basic_inference --model_idx 19 --task "animals" --subtask "can_fly" --test_batch_size 5 --max_tokens 2--start_pos: The token position from which to start tracing. Due to long runtimes and an exponentially larger number of paths towards the beginning of the prompt, we recommend tracing from later token positions.
--end_pos: The token position at which to stop tracing (inclusive).
--threshold_rank: The rank threshold for saving paths. Default is 100, but can be set to lower. Setting to higher will result in more paths being computed and greater runtimes/memory load.
The code always assumes that we are tracing one data sample at a time. Therefore, iteration over many samples is done within the bash script.
Tracing is only done on contrastive tasks, and both subtasks are done by default. So, no --subtask parameter is passed.
# Do path tracing with Olmo-2 1B DPO on the Adjectives tasks, on samples 0 - 49 (inclusive).
for i in $(seq 0 49);
do
uv run python -m experiments.path_tracing --model_idx 16 --task "adjectives" --tracing_sample_idx $i --start_pos 11 --end_pos -1 --threshold_rank 100
done
If you found our data or code helpful, please cite our paper:
@InProceedings{smith:20xx:CONFERENCE_TITLE,
author = {Smith, John},
title = {My Paper Title},
booktitle = {Proceedings of the 20XX Conference on XXXX},
month = mmm,
year = {20xx},
address = {Gotham City, USA},
publisher = {Association for XXX},
pages = {XXXX--XXXX},
url = {http://xxxx.xxx}
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.