x-teaming.github.io/README.md at main · x-teaming/x-teaming.github.io

𝕏-Teaming

This website showcases our work on 𝕏-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents. The website is adapted from Nerfies website.

Project Overview

𝕏-Teaming is a scalable framework that systematically explores how seemingly harmless interactions with language models can escalate into harmful outcomes. Our framework achieves state-of-the-art multi-turn jailbreak effectiveness with success rates up to 98.1% across leading models.

Key Features

Multi-turn jailbreak framework with adaptive multi-agents
State-of-the-art attack success rates on various models
XGuard-Train: A comprehensive safety training dataset
Systematic approach to understanding and mitigating conversational attacks

Resources

Paper
Code
Dataset

Website License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

𝕏-Teaming

Project Overview

Key Features

Resources

Website License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

𝕏-Teaming

Project Overview

Key Features

Resources

Website License