-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Labels
dynamicembRelated with dynamicembRelated with dynamicembenhancementImprovement for existing featureImprovement for existing feature
Milestone
Description
Background
Currently, DynamicEmb has custom input_dist implementation (RwSparseFeaturesDist in input_dist.py) but still relies on TorchRec's original output_dist implementation. This causes:
- Performance issue: The
unbucketize_permuteoperation in TorchRec's output distribution is slow, especially for non-contiguous distribution patterns (e.g., round-robin) - Limited customization: Cannot optimize the output distribution without modifying TorchRec source code
Objective
Port TorchRec's output distribution classes to DynamicEmb library, enabling future performance optimizations.
Tasks
PR 1: Port output distribution classes to DynamicEmb
- Create
dynamicemb/output_dist.pywith:RwSequenceEmbeddingDistRwPooledEmbeddingDist
- Update
dynamicemb/planner/rw_sharding.pyto overridecreate_output_dist()methods - Verify with existing tests (
test_sequence_embedding_fw.py,test_pooled_embedding_fw.py)
PR 2: Optimize unbucketize permute with custom kernel
- Design optimized data format for permute tensor
- Implement CUDA kernel for efficient unbucketize operation
- Integrate with
output_dist.py - Benchmark and validate performance improvement
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dynamicembRelated with dynamicembRelated with dynamicembenhancementImprovement for existing featureImprovement for existing feature