Built on LTX-Video-2B's lightweight architecture, LTX-MultiMotion re-engineers encoder and decoder to transform the spatial width dimension into a number-of-person-related representation. This enables dynamic motion generation for varying character counts without pre-defined constraints.
- Person-Aware Architecture: We repurpose the spatial width dimension to dynamically represent varying numbers of characters in a unified latent space.
- Specialized Pathways: A triple-branch decoder with dedicated networks isolates and generates translation, orientation, and pose components in parallel.
- Progressive Expansion: The encoder and decoder network begins with a shallow structure and dynamically increases its depth layer-by-layer during training, adapting its capacity to the complexity of the motion data.
- LTX-Video First-Stage Model: Download
ltxv-2b-0.9.8-distilled.safetensorsfrom Lightricks/LTX-Video on Hugging Face. - Text Encoder: Use the same text encoder as LTX-Video:
PixArt-alpha/PixArt-XL-2-1024-MSfrom Hugging Face.
- Motion Data: Place raw motion data (standardized as SMPL format, similar to AMASS) in
./motions/. Data can be obtained from InterGen. - Annotations: Place per-person motion annotations in
./separate_annots/. Data can be obtained from FreeMotion.
python latent_prepare.pyTrain separate decoders for translation, orientation, and pose:
python decoder_train.py --features_dir [path_to_prepared_latents_from_stage1] --gt_dir ./motions --branch [trans/root/pose] --target_loss Loss --max_depth 20 --initial_depth 2 --patience 10 --min_improvement 0.001 --batch_size 1 --num_epochs 500 --learning_rate 1e-4 --num_workers 8 --save_freq 10 --save_dir [save_directory] --device cuda --min_epochs_after_depth_increase 30We provide pre-trained decoder models on Hugging Face at Jonnty/LTX-MultiMotion for direct use. All models were trained for 500 epochs with progressive depth expansion:
- Translation Decoder: Loss improved from 0.19 to 0.015, current depth: 16 layers
- Root (Orientation) Decoder: Loss improved from 0.64 to 0.10, current depth: 13 layers
- Pose Decoder: Loss improved from 0.037 to 0.018, current depth: 17 layers
Generate motion from a text prompt:
python inference.py --prompt "Promote" --height 32 --width [32*number_of_person] --num_frames [number_of_frames] --pipeline_config configs/ltxv-2b-0.9.8-distilled.yaml --motion_mode --root_checkpoint_path [path_to_root_decoder_checkpoint] --trans_checkpoint_path [path_to_translation_decoder_checkpoint] --pose_checkpoint_path [path_to_pose_decoder_checkpoint]This project is modified from LTX-Video.
