Major Features and Improvements
Train/Eval/Predict/Export
- Support training with dynamic batch size by sample cost in #343
- Support logging train metrics in #310
- Support predicting checkpoint in #320 #322 #324
- [EXPERIMENTAL] Support exporting with AOTInductor in #239 #274
- Support exporting with TensorRT in #318
- Support exporting the best model in #294
- Support exporting to RTP in #298 #307 #329 #332 #339
- Support AdamW optimizer and label smoothing in #297
- Support setting an optimizer for a subset of parameters in #297
- Support PanguDFS in #311 #348 #349 #350
Embedding
- Support dynamic embedding in #279 #281 #283 #286 #289 #316
- Support initialize dynamic embeddings from tables in #282 #288
- MLPEmbedding support feature value_dim > 1 in #331
Model
- Optimize and refactor DlrmHSTU preprocessor to support MTGR style preprocessing in #290 #296 #300 #314
- Decouple contextual feature dimension from sequence id embedding dimension in DlrmHSTU in #302
- DlrmHSTU support uih and contextual share embedding in #337
- DlrmHSTU support global average loss option in #334
- Add TMA support for hstu attn in #336
- Optimize gpu memory usage of GAUC metric in #312
Feature
Upgrade
- Upgrade pytorch to v2.9 and torchrec to v1.4.0 in #345
Note
For TorchEasyRec 1.0.x, you should use Docker image version 1.0.
- For the GPU version (CUDA 12.6):
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.0-cu126- PyTorch: v2.9 CUDA: v12.6 FBGEMM: v1.4.0 TorchRec: v1.4.0 Python: v3.11
- We drop support for the 470 GPU driver version. If you still want to use the 470 GPU driver version, you can set LD_LIBRARY_PATH=/usr/local/cuda-12.6/compat
- For the CPU version:
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.0-cpu- PyTorch: v2.9 FBGEMM: v1.4.0 TorchRec: v1.4.0 Python: v3.11
Bug Fixes and Other Changes
- [feat] make bash as default shell by @tiankongdeguiji in #273
- [feat] add benchmark odps quota and skip trt test when trt not avaiable by @tiankongdeguiji in #278
- [feat] add rdma addons into dockerfile by @tiankongdeguiji in #280
- [feat] clean up fg_encoded in docs by @tiankongdeguiji in #287
- support create tzrec config based on pyfg json by @chengaofei in #284
- [bugfix] fix finetune checkpoint path runtime error print when path not exist by @tiankongdeguiji in #291
- [feat] refactor export model by @tiankongdeguiji in #293
- fix sequence raw feature pyfg sub_type not effective by @chengaofei in #292
- [feat] optimize hstu triton op warning by @tiankongdeguiji in #301
- [bugfix] improve create init ckpt for dynamic embedding when certain id_feature in the config lack embedding_dim by @tiankongdeguiji in #303
- [feat] bump up tzrec version to 0.9.7 by @tiankongdeguiji in #305
- [bugfix] fix dlrm hstu gauc and l2 loss support by @tiankongdeguiji in #306
- [bugfix] fix content encoder with additional_content_features and target_enrich_features by @tiankongdeguiji in #304
- [bugfix] fix create dynamic embedding ckpt when raw feature in config by @tiankongdeguiji in #308
- Add evaluation metrics documentation by @yanzhen1233 in #309
- [bugfix] fix fsspec ci test by @tiankongdeguiji in #317
- add train_metric docs by @chengaofei in #315
- Update custom development model documentation by @yanzhen1233 in #313
- [feat] Adapt integration test config to local cuda device count by @eric-gecheng in #319
- [bugfix] fix dlrm hstu preprocessor doc by @tiankongdeguiji in #321
- [feat] add dlrm hstu demo data by @tiankongdeguiji in #326
- [bugfix] avoid jit convert error when using large number by @eric-gecheng in #328
- [feat] improve prune_unused_param_and_buffer when export model by @tiankongdeguiji in #327
- [feat] refactor ops directory to fix import triton error by @tiankongdeguiji in #335
- [feat]add assert to avoid using ckpt predict for two tower models by @eric-gecheng in #333
- [feat] upgrade 2025 dingtalk qrcode by @tiankongdeguiji in #340
- [bugfix] always lazy init predict checkpoint writer by @tiankongdeguiji in #341
- Feature/dynamic routing support zero init by @eric-gecheng in #342
- [bugfix] fix parse batch empty MapArray error in NegativeSampler by @tiankongdeguiji in #344
- [feat] add doc for dynamic batch by @tiankongdeguiji in #346
- [bugfix] fix array type in feature doc by @tiankongdeguiji in #347
Full Changelog: v0.9.0...v1.0.0