By Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, and Guang Lin
- 2026-02-05: We released the training code.
- 2026-01-25: DiDi-Instruct was accepted by ICLR.
- 2026-01-16: Invited talk on DiDi-Instruct is now available on YouTube.
- 2025-10-06: We update the Blog.
- 2025-10-05: We released the checkpoint on Hugging Face.
- 2025-10-03: We updated the evaluation code and released the model checkpoint.
- 2025-09-29: We uploaded our work to arXiv.
Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64× acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around 1%) and reduce additional training wall-clock time by more than 20× compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream tasks, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye.
DiDi-Instruct (distilled from 169M MDLM)
Distilled few-step student → up to 64× speedup with matched/better quality.
Before first use, create and activate the conda environment from the provided environment.yml:
conda env create -f environment.yml
conda activate mask_modelYou need a pre-trained discrete diffusion language model (dLLM) as the teacher. You have two options:
-
Option A (Train from scratch):
- Refer to this script from DUO to train your own teacher model on OpenWebText.
- This produces a checkpoint (e.g.,
mdlm.ckpt).
-
Option B (Use pre-trained checkpoint):
- Download a pre-trained checkpoint from Google Drive (
mdlm.ckpt). - Place it in the
./out/directory for later use.
- Download a pre-trained checkpoint from Google Drive (
Once you have the teacher model checkpoint, distill a few-step student model for fast inference:
bash ./scripts/distill-didi-instruct-owt.shThe script will look for the teacher checkpoint and begin the distillation process.
Pre-trained checkpoint options:
-
Option 1 (from Google Drive):
- Download the distilled DiDi-Instruct checkpoint from Google Drive (
didi-instruct.ckpt). - Place the
.ckptfile in the./out/directory.
- Download the distilled DiDi-Instruct checkpoint from Google Drive (
-
Option 2 (from Hugging Face):
-
We provide the distilled model on Hugging Face.
-
Convert it to
.ckptformat using:python ./models/hf_to_ckpt.py --hf_repo_id "haoyangzheng/didi-instruct-small" --output_dir "./out/didi-instruct.ckpt"
-
Evaluate the distilled student model's performance by measuring perplexity and entropy compared to the teacher and baseline models:
bash ./scripts/eval-didi-instruct.shThis produces performance metrics on the OpenWebText validation set.
didi-instruct-train/
├── configs/ # Configuration files
├── models/ # Model implementations
├── scripts/ # Training and evaluation scripts
├── out/ # Checkpoints and logs
├── algo.py # Core algorithm implementations
├── dataloader.py
├── main.py
├── metrics.py
├── trainer_base.py
├── utils.py
├── environment.yml
├── README.md
└── LICENSE.md
This repository is built upon DUO: "The Diffusion Duality. ICML 2025".
We also adopt ideas from DiMO, MDLM, SDTT, and nanoGPT.
If you find this repository useful, please cite the following work:
@article{zheng2025ultra,
title={{Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct}},
author={Zheng, Haoyang and Liu, Xinyang and Kong, Cindy Xiangrui and Jiang, Nan and Hu, Zheyuan and Luo, Weijian and Deng, Wei and Lin, Guang},
journal={{Proceedings of the International Conference on Learning Representations (ICLR)}},
year={2026}
}


