Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 1.32 KB

File metadata and controls

20 lines (15 loc) · 1.32 KB

𝕏-Teaming

This website showcases our work on 𝕏-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents. The website is adapted from Nerfies website.

Project Overview

𝕏-Teaming is a scalable framework that systematically explores how seemingly harmless interactions with language models can escalate into harmful outcomes. Our framework achieves state-of-the-art multi-turn jailbreak effectiveness with success rates up to 98.1% across leading models.

Key Features

  • Multi-turn jailbreak framework with adaptive multi-agents
  • State-of-the-art attack success rates on various models
  • XGuard-Train: A comprehensive safety training dataset
  • Systematic approach to understanding and mitigating conversational attacks

Resources

Website License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.