feat(checkpoint): support universal checkpoint #394
Open
li126com wants to merge 12 commits intoInternLM:developfrom
Open
feat(checkpoint): support universal checkpoint #394li126com wants to merge 12 commits intoInternLM:developfrom
li126com wants to merge 12 commits intoInternLM:developfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
特别声明:本功能模块技术路线基于veScale checkpoint和ByteCheckpoint实现。
veScale:https://github.com/volcengine/veScale/tree/main
ByteCheckpoint:https://arxiv.org/abs/2407.20143
通用检查点系统
通用ckpt系统独立于原版ckpt系统,相互不兼容。
基本功能
Dense 模型下 model ckpt 和 optimizer ckpt 的各种并行配置的动态加载支持:
优化项
精度验证
从dp4_zero2_tp2_pp2的配置下第100步ckpt开始续训:

性能对比
7B,16卡,dp4_zero2_tp2_pp2的配置
保存ckpt时间:
原版ckpt:38.8s
通用ckpt:首次保存18.5s,后续保存0.88s (save cache + async save)
加载ckpt时间:
相同配置下再加载,通用ckpt和原版ckpt时间差不多都在22s左右。
变动配置下,取决于具体的新配置,上述精度测试几组实验下来通用ckpt加载时间为22s-70s