Skip to content

v1.5.0

Latest

Choose a tag to compare

@TroyGarden TroyGarden released this 15 Feb 20:00
· 97 commits to main since this release

Announcement

Starting with the next release, TorchRec will migrate from Pyre to Pyrefly for type checking. Contributors should be aware that type checking workflows and configurations will change accordingly.

New Features

Fully Sharded 2D Parallelism

Fully Sharded 2D Parallelism introduces a new sharding strategy that combines fully sharded parallelism with 2D distribution, enabling more efficient utilization of GPU resources for large-scale embedding tables. This includes support for uneven shard sizes, dynamic 2D sharding, and annotated collectives.

  • Fully Sharded 2D Parallelism [#3558]
  • Uneven shard sizes support for Fully Sharded 2D collectives [#3584]
  • Dynamic 2D + Fully Sharded 2D [#3600]
  • Fix padding logic for Fully Sharded 2D [#3626]
  • Add annotations to Fully Sharded 2D collectives [#3678]

Train Pipeline Enhancements

Several new pipeline capabilities have been added, including fused sparse dist enhancements for improved training throughput and in-place batch copy to save HBM usage.

  • Add inplace_copy_batch_to_gpu in TrainPipeline, enabling non-blocking host-to-device batch transfer to save HBM usage [#3526, #3532, #3641]
  • Add device parameter to KeyedJaggedTensor.empty_like and copy_ methods [#3510]
  • Refactor TrainPipelineBase to clean input batch after the forward pass [#3530]
  • Support enqueue_batch_after_forward in TrainPipelineFusedSparseDist [#3675]
  • Train pipeline with FP own bucket feature [#3683]

Benchmark

Continued expansion of the unified benchmarking framework with new benchmark scenarios, memory profiling, and a GitHub benchmark workflow.

  • Memory analysis and profiling: CUDA memory footprint in multi-stream scenario, memory snapshot for non-blocking copy [#3480, #3485, #3504]
  • Device-to-Host LazyAwaitable with knowledge sharing, demonstrating host-device comms [#3477, #3492]
  • Add new benchmark scenarios: base pipeline light, KV-ZCH, MP-ZCH, VBE [#3580, #3540, #3604, #3642, #3585]
  • Benchmark infrastructure improvements: ModelSelectionConfig, prettified output, log level [#3467, #3639, #3494]
  • Create GitHub benchmark workflow [#3631]

RecMetrics

New metrics have been added to expand TorchRec's metric coverage for multi-label, regression, and serving use cases.

  • Per-label precision metric for multi-label tasks [#3661]
  • Lifetime AUPRC [#3674]
  • Serving AE loss metric [#3681]
  • Label average metric for regression APS model [#3650]
  • NMSE metric for APS model [#3489 (internal)]

Delta Tracking and Publishing

The ModelDeltaTracker and DeltaStore APIs have been generalized and extended with raw ID tracking capabilities, enabling more flexible delta-based model publishing workflows.

  • Update ModelDeltaTracker and DeltaStore to be Generic [#3468, #3469, #3470]
  • Update DeltaCheckpointing and DeltaPublish with generic model tracker [#3543]
  • ModelDeltaTracker improvements: post-init initialization, optim state tracking, optim state init bug fix [#3472, #3143, #3476]
  • Raw ID tracker: add tracker, wrapper, post lookup function, DMP integration, and hash_zch_runtime_meta support [#3500, #3501, #3502, #3506, #3527, #3541, #3542, #3545, #3598, #3599]

KVZCH Enhancements

Continued improvements to Key-Value Zero-Collision Hashing, including auto feature score collection and eviction policy updates.

  • Enable feature score auto collection in EBC and EC [#3475, #3474]
  • Eviction policy improvements: no-eviction support, free mem trigger with all2all, skip feature score threshold for ttl, config rename [#3488, #3490, #3552, #3514]
  • Per-feature ZCH lookup support for memory layer [#3618]

Python Free-Threading Support

TorchRec now supports Python free-threading (PEP 703) on Python 3.14, enabling better performance in multi-threaded environments.

  • Support python free-threading [#3684, #3686]
  • Update supported Python version in setup.py, unittest workflow, and documentation [#3596, #3662, #3483]

Change Log

  • Direct MX4→BF16 dequantization to reduce memory [#3620]
  • Input distribution latency estimations [#3575]
  • Enable specifying output dtype for fp8 quantized communication [#3568]
  • Add custom all2all interface [#3454]
  • Adding MC EBC quant embedding modules for inference [#3572]
  • VBE improvements: KJT validator, pre-allocated output tensor & offsets for TBE, MC-EBC support [#3645, #3624, #3617]
  • PT2 compatibility: TBE serialization support to IR, EBC short circuit kwargs support, dynamo pruning logic update, generate 1 acc graph by removing fx wrapper for KJT [#3637, #3557, #3566, #3582]
  • Rowwise for feature processors [#3606]
  • Fix grad clipping compatibility with CPU training [#3679]
  • Fix NaN handling in AUPRC metric calculation [#3523]
  • FQN/checkpointing tests for RecMetrics [#3612]
  • Add Metric compatibility test for RecMetricsModule [#3586]
  • Shard plan validation: shard to rank assignment [#3495]
  • Debug embedding modules for NaN detection in backward [#3519]
  • Enable logging for plan(), ShardEstimators, and TrainingPipeline constructors [#3576]
  • Object_id dedup for fused optimizer [#3666]
  • Cache weight/optimizer tensor mappings for efficient sync() [#3610]
  • Test fixes and stability improvements [#3592, #3672, #3644, #3621, #3590, #3589, #3528]
  • full change log

compatibility

  • fbgemm-gpu==1.5.0
  • torch==2.10.0

test results