Ultra-Fast Language Generation via
Discrete Diffusion Divergence Instruct (DiDi-Instruct)

By Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, and Guang Lin

🔄 Updates

2026-02-05: We released the training code.
2026-01-25: DiDi-Instruct was accepted by ICLR.
2026-01-16: Invited talk on DiDi-Instruct is now available on YouTube.
2025-10-06: We update the Blog.
2025-10-05: We released the checkpoint on Hugging Face.
2025-10-03: We updated the evaluation code and released the model checkpoint.
2025-09-29: We uploaded our work to arXiv.

Abstract

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64× acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around 1%) and reduce additional training wall-clock time by more than 20× compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream tasks, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye.

🚀 Feel the Generation Speed

Auto-Regressive Model (GPT-2 Small)
_{Token-by-token generation → high latency}

Masked Diffusion Model (MDLM, 169M)
_{Iterative denoising → faster than GPT-2 Small.}

DiDi-Instruct (distilled from 169M MDLM)
_{Distilled few-step student → up to 64× speedup with matched/better quality.}

🏗️ Usage Guide

1. Create and Activate the Conda Environment

Before first use, create and activate the conda environment from the provided environment.yml:

conda env create -f environment.yml
conda activate mask_model

2. Prepare the Teacher Model

You need a pre-trained discrete diffusion language model (dLLM) as the teacher. You have two options:

Option A (Train from scratch):
- Refer to this script from DUO to train your own teacher model on OpenWebText.
- This produces a checkpoint (e.g., mdlm.ckpt).
Option B (Use pre-trained checkpoint):
- Download a pre-trained checkpoint from Google Drive (mdlm.ckpt).
- Place it in the ./out/ directory for later use.

3. Distill the Student Model

Once you have the teacher model checkpoint, distill a few-step student model for fast inference:

bash ./scripts/distill-didi-instruct-owt.sh

The script will look for the teacher checkpoint and begin the distillation process.

Pre-trained checkpoint options:

Option 1 (from Google Drive):
- Download the distilled DiDi-Instruct checkpoint from Google Drive (didi-instruct.ckpt).
- Place the .ckpt file in the ./out/ directory.

Option 2 (from Hugging Face):

We provide the distilled model on Hugging Face.

Convert it to .ckpt format using:

python ./models/hf_to_ckpt.py --hf_repo_id "haoyangzheng/didi-instruct-small" --output_dir "./out/didi-instruct.ckpt"

4. Evaluate the Model

Evaluate the distilled student model's performance by measuring perplexity and entropy compared to the teacher and baseline models:

bash ./scripts/eval-didi-instruct.sh

This produces performance metrics on the OpenWebText validation set.

📁 Repository Structure

didi-instruct-train/
├── configs/    # Configuration files
├── models/     # Model implementations
├── scripts/    # Training and evaluation scripts
├── out/        # Checkpoints and logs
├── algo.py     # Core algorithm implementations
├── dataloader.py
├── main.py
├── metrics.py
├── trainer_base.py
├── utils.py
├── environment.yml
├── README.md
└── LICENSE.md

📚 References

This repository is built upon DUO: "The Diffusion Duality. ICML 2025".

We also adopt ideas from DiMO, MDLM, SDTT, and nanoGPT.

📖 Citation

If you find this repository useful, please cite the following work:

@article{zheng2025ultra,
  title={{Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct}},
  author={Zheng, Haoyang and Liu, Xinyang and Kong, Cindy Xiangrui and Jiang, Nan and Hu, Zheyuan and Luo, Weijian and Deng, Wei and Lin, Guang},
  journal={{Proceedings of the International Conference on Learning Representations (ICLR)}},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ultra-Fast Language Generation via
Discrete Diffusion Divergence Instruct (DiDi-Instruct)

🔄 Updates

Abstract

🚀 Feel the Generation Speed

Auto-Regressive Model (GPT-2 Small)
_{Token-by-token generation → high latency}

Masked Diffusion Model (MDLM, 169M)
_{Iterative denoising → faster than GPT-2 Small.}

DiDi-Instruct (distilled from 169M MDLM)
_{Distilled few-step student → up to 64× speedup with matched/better quality.}

🏗️ Usage Guide

1. Create and Activate the Conda Environment

2. Prepare the Teacher Model

3. Distill the Student Model

4. Evaluate the Model

📁 Repository Structure

📚 References

📖 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
demos		demos
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
algo.py		algo.py
dataloader.py		dataloader.py
dit.py		dit.py
environment.yml		environment.yml
main.py		main.py
metrics.py		metrics.py
trainer_base.py		trainer_base.py
utils.py		utils.py

License

haoyangzheng-ai/didi-instruct

Folders and files

Latest commit

History

Repository files navigation

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct (DiDi-Instruct)

🔄 Updates

Abstract

🚀 Feel the Generation Speed

Auto-Regressive Model (GPT-2 Small)Token-by-token generation → high latency

Masked Diffusion Model (MDLM, 169M)Iterative denoising → faster than GPT-2 Small.

DiDi-Instruct (distilled from 169M MDLM)Distilled few-step student → up to 64× speedup with matched/better quality.

🏗️ Usage Guide

1. Create and Activate the Conda Environment

2. Prepare the Teacher Model

3. Distill the Student Model

4. Evaluate the Model

📁 Repository Structure

📚 References

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Ultra-Fast Language Generation via
Discrete Diffusion Divergence Instruct (DiDi-Instruct)

Auto-Regressive Model (GPT-2 Small)
_{Token-by-token generation → high latency}

Masked Diffusion Model (MDLM, 169M)
_{Iterative denoising → faster than GPT-2 Small.}

DiDi-Instruct (distilled from 169M MDLM)
_{Distilled few-step student → up to 64× speedup with matched/better quality.}

Packages