从MiniMind-LLM微调扩散语言模型 #618
jingyaogong
announced in
Announcements
Replies: 2 comments 2 replies
-
|
老哥是什么方向的,感觉自己代码能力比较差,该咋办 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
sir I'm a newbie, where is the dataset for training the LLM? I found the text is not coherent, is it due to the lack of training set or training steps? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
SFT AR-2-MaskedDiffusion Model
dLLM(扩散大语言模型) 最近挺火的,然而话说在前,它几乎不在我的研究范围,未来大概也不会follow。
这里写了极简实现——用最短的改动,把现有的 AR MiniMind模型续训转换成(还不太具有逻辑的)dLLM
之所以存在很多特性(包括本内容)不合并到主分支:
这个 Discussion 能帮你做什么?
核心改动唯3
完整代码
稍后按照下面目录树结构进行创建即可
model_minimind_dllm.py
train_full_sft_dllm.py
eval_dllm.py
dllm_demo.py
lm_dataset.py
260115之前需要替换为:
>🚀 快速开始
Step 1: 把上面贴出的代码放到对应位置
目录树参考:
Step 2: 训练(默认从 AR 权重初始化)
基于minimind2的full_sft_512.pth检查点微调(大约只要5~10min):
(gongjy) root@ubuntu:~/minimind$ CUDA_VISIBLE_DEVICES=0 python train_sft_dllm.py Epoch:[1/2](100/37961), loss: 4.0809, mask_ratio: 0.31, lr: 0.00005000, eta: 46.0min Epoch:[1/2](200/37961), loss: 3.8554, mask_ratio: 0.27, lr: 0.00005000, eta: 34.0min Epoch:[1/2](300/37961), loss: 4.0565, mask_ratio: 0.30, lr: 0.00005000, eta: 30.0min Epoch:[1/2](400/37961), loss: 4.0782, mask_ratio: 0.26, lr: 0.00005000, eta: 28.0min Epoch:[1/2](500/37961), loss: 3.8238, mask_ratio: 0.30, lr: 0.00005000, eta: 27.0min Epoch:[1/2](600/37961), loss: 4.1514, mask_ratio: 0.35, lr: 0.00004999, eta: 26.0min Epoch:[1/2](700/37961), loss: 3.4831, mask_ratio: 0.28, lr: 0.00004999, eta: 25.0min Epoch:[1/2](800/37961), loss: 3.6387, mask_ratio: 0.36, lr: 0.00004999, eta: 25.0min Epoch:[1/2](900/37961), loss: 3.1424, mask_ratio: 0.30, lr: 0.00004998, eta: 25.0min Epoch:[1/2](1000/37961), loss: 3.2463, mask_ratio: 0.25, lr: 0.00004998, eta: 24.0min Epoch:[1/2](1100/37961), loss: 3.6453, mask_ratio: 0.29, lr: 0.00004998, eta: 24.0min Epoch:[1/2](1200/37961), loss: 3.2778, mask_ratio: 0.24, lr: 0.00004997, eta: 24.0min Epoch:[1/2](1300/37961), loss: 3.3814, mask_ratio: 0.29, lr: 0.00004997, eta: 24.0min Epoch:[1/2](1400/37961), loss: 2.9186, mask_ratio: 0.22, lr: 0.00004996, eta: 24.0min Epoch:[1/2](1500/37961), loss: 3.3576, mask_ratio: 0.33, lr: 0.00004996, eta: 24.0min Epoch:[1/2](1600/37961), loss: 3.6901, mask_ratio: 0.32, lr: 0.00004995, eta: 23.0min Epoch:[1/2](1700/37961), loss: 2.1673, mask_ratio: 0.21, lr: 0.00004994, eta: 23.0min Epoch:[1/2](1800/37961), loss: 3.3618, mask_ratio: 0.28, lr: 0.00004994, eta: 22.0min Epoch:[1/2](1900/37961), loss: 2.7446, mask_ratio: 0.24, lr: 0.00004993, eta: 22.0min Epoch:[1/2](2000/37961), loss: 2.3897, mask_ratio: 0.22, lr: 0.00004992, eta: 22.0min Epoch:[1/2](2100/37961), loss: 3.2938, mask_ratio: 0.30, lr: 0.00004992, eta: 22.0min Epoch:[1/2](2200/37961), loss: 2.9948, mask_ratio: 0.24, lr: 0.00004991, eta: 22.0min Epoch:[1/2](2300/37961), loss: 3.0744, mask_ratio: 0.29, lr: 0.00004990, eta: 22.0min Epoch:[1/2](2400/37961), loss: 3.2724, mask_ratio: 0.32, lr: 0.00004989, eta: 22.0min Epoch:[1/2](2500/37961), loss: 2.6671, mask_ratio: 0.22, lr: 0.00004988, eta: 22.0min Epoch:[1/2](2600/37961), loss: 3.1767, mask_ratio: 0.30, lr: 0.00004987, eta: 22.0min Epoch:[1/2](2700/37961), loss: 2.3531, mask_ratio: 0.19, lr: 0.00004986, eta: 22.0min Epoch:[1/2](2800/37961), loss: 3.2795, mask_ratio: 0.34, lr: 0.00004985, eta: 22.0min Epoch:[1/2](2900/37961), loss: 2.5039, mask_ratio: 0.21, lr: 0.00004984, eta: 22.0min Epoch:[1/2](3000/37961), loss: 2.5798, mask_ratio: 0.24, lr: 0.00004983, eta: 22.0min Epoch:[1/2](3100/37961), loss: 2.9274, mask_ratio: 0.29, lr: 0.00004982, eta: 22.0min Epoch:[1/2](3200/37961), loss: 2.6174, mask_ratio: 0.25, lr: 0.00004980, eta: 22.0min Epoch:[1/2](3300/37961), loss: 3.1982, mask_ratio: 0.30, lr: 0.00004979, eta: 21.0min Epoch:[1/2](3400/37961), loss: 2.9121, mask_ratio: 0.27, lr: 0.00004978, eta: 21.0min Epoch:[1/2](3500/37961), loss: 2.9347, mask_ratio: 0.23, lr: 0.00004976, eta: 21.0min Epoch:[1/2](3600/37961), loss: 3.2623, mask_ratio: 0.31, lr: 0.00004975, eta: 21.0min ...Step 3: 评估:
训练1000 steps
训练10000 steps(最终导出)
导出的[512*8]模型权重可在此链接下载
https://huggingface.co/jingyaogong/MiniMind2-Pytorch/blob/main/dllm_sft_512.pth
Proof-of-Concept,相当于只使用了60M数据进行最小的A2D微调,此时的模型似乎学会了说话,它知道如何回答问题,知道基础语法,但连贯性和逻辑性还很糟糕。不过没关系,对于一个只花了不到10 min微调出来的dLLM来说,这已经相当不错了。
对于这个Toy Model,还存在无数可以改进的地方:
业内对dLLM发展普遍不乐观,效果上限目前看还是在追着AR(且明显差一截),Discrete Diffusion 不像连续扩散那样有漂亮的数学框架撑着,说白了现在在大号 “Masked Bert” 的老路上难真正work,CoT计算时拓展范式似乎也做不到,以及KV Cache无关,并发推理等等一堆下游应用问题,从理论到生态再到工程上都还差得远。
最后想说,扩散语言模型是个有趣的方向,但基于其存在的种种问题,未来在缺乏契机的情况下,我可能不再继续研究。
作为抛砖引玉的帖子,如果你对此感兴趣,欢迎 fork 折腾!
任何问题可以在 Discussion 里讨论(但不保证回复) 🤗😉🫣
Happy!
Beta Was this translation helpful? Give feedback.
All reactions