LLM Training Cookbook 🍳

Welcome to the LLM Training Cookbook! This repository serves as a collection of recipes, acceleration techniques, and best practices for training Large Language Models (LLMs) effectively and efficiently.

📚 Introduction

Training large models requires a deep understanding of distributed systems, memory management, and optimization techniques. This cookbook aims to demystify these concepts and provide actionable guides for researchers and engineers.

🍲 Recipes

Frameworks & Tools

Megatron-LM: Best for pre-training massive models from scratch.
DeepSpeed: Excellent for fine-tuning and mixed-precision training.
FSDP (Fully Sharded Data Parallel): PyTorch native solution for distributed training.
Hugging Face Trainer: High-level API for quick experimentation.

Model Specifics

Qwen2 / Qwen3: Tips for handling Qwen's specific architecture (e.g., dynamic ntk).
Qwen3-vl-moe: Handling MoE (Mixture of Experts) models efficiently.

🚀 Acceleration Techniques

Memory Optimization

Gradient Checkpointing: Trade compute for memory by recomputing activations.
Flash Attention 2 / 3: Faster attention mechanisms that scale linearly with sequence length.
Mixed Precision (BF16/FP16): Reduce memory usage and increase throughput.
Quantization (QLoRA, AWQ): Train larger models on consumer hardware.

Distributed Training Strategies

Data Parallel (DP / DDP): Replicate model across GPUs.
Tensor Parallel (TP): Split individual layers across GPUs (intra-layer).
Pipeline Parallel (PP): Split model layers across GPUs (inter-layer).
Context Parallel (CP): Split sequence dimension for ultra-long context training.

🛠️ Best Practices

Environment Setup

Docker: Always use containerized environments for reproducibility.
Conda/Mamba: Manage python dependencies strictly.
Flash Attention Installation: Often tricky; compiled from source is recommended for performance.

Monitoring & Debugging

WandB / MLflow: Vital for tracking loss curves and system metrics.
TensorBoard: Good for visualizing graph execution.
NCCL Tests: Verify inter-GPU communication bandwidth before training.
OOM Troubleshooting:
1. Check max_split_size_mb.
2. Reduce batch size.
3. Enable CPU offloading (if acceptable performance hit).

🔗 Resources

Contributions are welcome! Please open a PR to add your favorite recipe.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
accelerating		accelerating
best_practices		best_practices
recipes		recipes
troubleshooting		troubleshooting
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Training Cookbook 🍳

📚 Introduction

🍲 Recipes

Frameworks & Tools

Model Specifics

🚀 Acceleration Techniques

Memory Optimization

Distributed Training Strategies

🛠️ Best Practices

Environment Setup

Monitoring & Debugging

🔗 Resources

About

Uh oh!

Releases

Packages

Languages

iqiancheng/llms-cookbook

Folders and files

Latest commit

History

Repository files navigation

LLM Training Cookbook 🍳

📚 Introduction

🍲 Recipes

Frameworks & Tools

Model Specifics

🚀 Acceleration Techniques

Memory Optimization

Distributed Training Strategies

🛠️ Best Practices

Environment Setup

Monitoring & Debugging

🔗 Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages