This project evaluates several failure modes in open-weight large language models by systematically probing how safety mechanisms break under adversarial but realistic interaction patterns.
Rather than focusing on novelty, the goal is to understand where guardrails fail, why they fail, and what signals a defender could use to detect or mitigate these failures in production systems.
The experiments were conducted using locally hosted models (Gemma 270M and GPT-OSS 20B) via Ollama to eliminate confounding effects from API-level filtering.
- Gemma 270M: lightweight, text-only model
- GPT-OSS 20B: larger open-weight model with reasoning and tool-use capabilities
- All experiments were run offline with controlled context length and reasoning settings.
This setup allows direct observation of model behavior without upstream moderation layers.
In multiple scenarios, the model initially refuses to provide medical or legal advice. By reframing the prompt using emotional role-play (for example, a grandmother telling a story) or humor, the refusal is bypassed without explicitly asking for prohibited advice.
The model does not become “unsafe” in a binary sense. Instead, it prioritizes narrative continuation over safety classification, producing content that functionally answers the restricted question.
Key observation:
- refusal logic is sensitive to narrative framing
- safety intent is inferred from tone as much as semantics
A separate class of prompts attempts to override safety behavior directly by instructing the model not to refuse and to answer “directly.”
In GPT-OSS 20B with low reasoning, this succeeds in eliciting disallowed operational details (for example, tools required to damage public infrastructure).
This highlights the risk of instruction hierarchies being flattened when system-level constraints are not enforced externally.
The model correctly refuses to generate explicitly vulnerable code when prompted directly.
However, a prompt-level transformation that encodes the same request through obfuscation or staged decoding results in the model producing C++ code that meets the original objective.
This demonstrates that safety generalization is brittle when intent is distributed across multiple transformation steps.
By flooding the context window with low-information text, earlier safety constraints are pushed out of the active attention window.
Once this occurs, the same disallowed prompt that previously triggered a refusal results in actionable guidance.
The failure is not malicious intent recognition, but loss of constraint visibility due to context limits.
Using a retrieval-style setup, the model is asked:
“What city was the 7th most populous in 2024?”
With the original dataset, the model answers correctly.
After modifying only the rank field in a duplicate CSV (while leaving population numbers unchanged), the model returns Columbus as the answer. The model does not recompute rankings and instead trusts the poisoned metadata.
This demonstrates:
- overreliance on retrieved structure
- lack of cross-field consistency checks
In a second experiment, the original dataset is left unmodified. A separate text file containing authoritative-sounding instructions is uploaded alongside the CSV.
The model consistently outputs “Columbus” for city-related questions, even when the query is unrelated to population ranking.
The instruction file overrides task intent despite being non-data.
Several patterns repeat across attack classes:
- safety decisions are highly context-dependent
- models trust structure and authority cues over semantic verification
- retrieval pipelines expand the attack surface beyond prompts alone
Effective defenses must operate outside the LLM, including:
- output-level safety classifiers
- instruction provenance enforcement
- schema validation for retrieved data
- isolation between data and instruction channels
- Open the notebook.
- Ensure the datasets are located in the
data/directory. - Execute cells sequentially to reproduce each attack and observe model outputs.
All attacks are implemented using black-box interaction only.
This repository is intended for defensive research and evaluation. The goal is to surface failure modes so they can be mitigated, not to operationalize misuse.