Welcome to the LLM Training Cookbook! This repository serves as a collection of recipes, acceleration techniques, and best practices for training Large Language Models (LLMs) effectively and efficiently.
Training large models requires a deep understanding of distributed systems, memory management, and optimization techniques. This cookbook aims to demystify these concepts and provide actionable guides for researchers and engineers.
- Megatron-LM: Best for pre-training massive models from scratch.
- DeepSpeed: Excellent for fine-tuning and mixed-precision training.
- FSDP (Fully Sharded Data Parallel): PyTorch native solution for distributed training.
- Hugging Face Trainer: High-level API for quick experimentation.
- Qwen2 / Qwen3: Tips for handling Qwen's specific architecture (e.g., dynamic ntk).
- Qwen3-vl-moe: Handling MoE (Mixture of Experts) models efficiently.
- Gradient Checkpointing: Trade compute for memory by recomputing activations.
- Flash Attention 2 / 3: Faster attention mechanisms that scale linearly with sequence length.
- Mixed Precision (BF16/FP16): Reduce memory usage and increase throughput.
- Quantization (QLoRA, AWQ): Train larger models on consumer hardware.
- Data Parallel (DP / DDP): Replicate model across GPUs.
- Tensor Parallel (TP): Split individual layers across GPUs (intra-layer).
- Pipeline Parallel (PP): Split model layers across GPUs (inter-layer).
- Context Parallel (CP): Split sequence dimension for ultra-long context training.
- Docker: Always use containerized environments for reproducibility.
- Conda/Mamba: Manage python dependencies strictly.
- Flash Attention Installation: Often tricky; compiled from source is recommended for performance.
- WandB / MLflow: Vital for tracking loss curves and system metrics.
- TensorBoard: Good for visualizing graph execution.
- NCCL Tests: Verify inter-GPU communication bandwidth before training.
- OOM Troubleshooting:
- Check
max_split_size_mb. - Reduce batch size.
- Enable CPU offloading (if acceptable performance hit).
- Check
- PyTorch FSDP Documentation
- DeepSpeed Configuration Guide
- Megatron-LM Repository
- Flash Attention Repository
Contributions are welcome! Please open a PR to add your favorite recipe.