Wan-Video · nalexand · Sep 9, 2025 · Sep 10, 2025 · Sep 10, 2025 · Sep 10, 2025
diff --git a/README.md b/README.md
@@ -10,6 +10,110 @@
     📕 <a href="https://alidocs.dingtalk.com/i/nodes/jb9Y4gmKWrx9eo4dCql9LlbYJGXn6lpz">使用指南(中文)</a>&nbsp&nbsp | &nbsp&nbsp 📘 <a href="https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y">User Guide(English)</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat(微信)</a>&nbsp&nbsp
 <br>
 
+
+<h1>How to run Wan2.2 locally on 8 GB VRAM (!!model code changed!!)</h1>
+<ol>
+<li> huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B</li>
+<li> convert high_noise_model and low_noise_model to float16 to fit one block in 8GB VRAM with <strong>convert_safetensors.py</strong> </li>
+<li> run <strong>optimize_files.py</strong> - split safetensors files by modules (run after convert_safetensors.py)</li>
+<li> python <strong>generate_local.py</strong> --task t2v-A14B --size "1280*720" --ckpt_dir ./Wan2.2-T2V-A14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."</li>
+</ol>
+
+<p>Generated frames are limited to 21 (1.3 sec 1280*704) to fit within 8 GB VRAM.</p>
+<p>For full 5 second video max resolution: 720*405 (on 8Gb VRAM)</p>
+<p></p>* tested on <b>HELIOS PREDATOR 300</b> laptop (3070Ti 8GB) 1280*704 - 72.22 s/it for 21 frames,  60.74 s/it for 17 frames, 13 frames 48.70 s/it </p>
+
+Wan2.2 on 8GB VRAM: Run Advanced AI Video Generation Locally! (Optimization Guide) https://youtu.be/LlqnghCNxXM
+
+## UPDATE 10/02/2025
+- **Optimized I2V-A14B** run long video generation loop with **loop.bat**
+
+https://github.com/user-attachments/assets/154df173-88d3-4ad1-b543-f7410380b13a
+
+
+
+## How it works
+- The setup is the same as for the T2V model: huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B, then run the convert and optimize scripts.
+- run example: **python generate_local.py --task i2v-A14B --size "1280*720" --image=./last_frame.png --ckpt_dir ./Wan2.2-I2V-A14B --prompt "In close-up, a cheetah runs at full speed in a narrow canyon, its golden fur gleaming in the sun, and its black tear marks clearly visible. Shot from a low angle, the cheetah's body is close to the ground, its muscles flowing, and its limbs alternately and powerfully step over stones and soil, stirring up dust. The cheetah's eyes are sharp, staring at the target in front of it, showing unparalleled speed and strength. The camera follows the cheetah's running trajectory, capturing every moment of leaping and turning, showing its amazing agility. The whole scene unfolds in a tense chase rhythm, full of wild charm and competition for survival."**
+- or edit prompt in **loop.bat** and run (command runs in loop, each iteration do one spep: create latent from image -> y_latents.pt, run inference -> final_latents.pt, decode video final_latents.pt -> last_frame_latents.pt, create latent from last frame last_frame_latents.pt -> y_latents.pt, run inference ...)
+- **to start new generation loop** with new image / prompt / frame count / size - delete: **y_latents.pt**, **final_latents.pt**, **last_frame_latents.pt**
+
+## Results on a 3070 Ti laptop GPU with 8 GB VRAM + 25 GB free RAM (some layers are loaded from the NVME drive; to fit everything in RAM, 30 GB of free RAM is needed):
+        # size 640*352
+        # 81 frames             58.23 s/it 51.32 s/it (*FP8)
+        # 33 frames             23.75 s/it              vae decode 4.5 sec
+
+        # 704 * 396, sampling_steps 25+
+        # frame_num = 49        24.72 s/it (FP16)
+        # frame_num = 81        77.50 s/it (FP16)
+
+        # size 720*405, sampling_steps 20+
+        # frame_num = 17        21.23 s/it (FP16)        vae decode 5.4 sec
+        # frame_num = 77        82.11 s/it (FP16)
+        # frame_num = 81 (best) 70.74 s/it (*FP8)        vae decode 12.2 sec
+
+        # size 832*464 / 848*448, sampling_steps 20+
+        # frame_num = 17        23.68 s/it               vae decode 3.54 sec
+        # frame_num = 53        74.34 s/it
+        # 65                    79.73 s/it
+
+        # size 960*540, sampling_steps 16+
+        # 17 frames             34.30 s/it (FP16)
+        # 41 frames             75.02 s/it (FP16)
+        # 45 frames             72.35 s/it (*FP8)       vae decode 11.7 sec
+
+        # size = 1120 * 630
+        # 13 frames         29.24 s/it (*FP8)
+        # 17 frames         37.90 s/it (*FP8)           vae decode 13.6 sec
+        # 33 frames (max)   85.10 s/it (FP16)
+        # 33 frames         76.49 s/it (*FP8)
+        # 37                85.16 s/it (*FP8)
+
+        ######################################################
+        # for 8gb vram and sizes > 1120 * 630 vae use slow shared video memory
+
+        # size 1280*720, sampling_steps 16+
+        # 13 frames         48.70 s/it (FP16)
+        # 13 frames         39.61 s/it (*FP8)          vae decode 17.4 sec
+        # 17 frames         60.74 s/it (FP16)
+        # 17 frames         54.02 s/it (*FP8)
+        # 21 frames (max)   72.22 s/it (FP16)
+        # 21 frames         66.18 s/it (*FP8)           vae decode 28 sec
+
+        # size 1600*896 / 1568*896, sampling_steps 15+
+        # 13 frames (max)   85.47 s/it (FP16)
+        # 13 frames (max)   63.88 s/it (*FP8)           vae decode ~115 sec
+
+        self.offload_large_tensors = False  # slower 20% inference but more frames per video
+        # ################# large tensors offloading ##########################
+
+        # 1280*720
+        # 33 frames         118.83 s/it (*FP8)          vae decode 38 sec
+
+        # 1568*896
+        # 21 frames         127.01 s/it (*FP8)          vae decode: 182 sec
+
+# Compared to ComfyUA
+              ComfyUA (fp8)                        This (fp16) optimized vae
+    1120*630 33 frames * 16 steps 1470 sec         85.10 s/it * 16 = 1362 sec 
+    vae decode                    +117 sec                            +26 sec    
+    total                         1587 sec                           1388 sec  1.14x faster 
+
+                                                   This (*fp8) optimized vae
+                                                   76.49 s/it * 16 = 1224 sec
+                                                                      +26 sec
+                                                                     1282 sec  1.27x faster !!!
+
+    1568*896 13 frames * 10 steps 69.31 s/it       63.88 s/it * 10 = 638.8 sec 
+                                    OOM                               +115 sec
+                                                                     922.8 sec 
+
+*fp8 - 3070 Ti doesn`t support calculations in fp8, loaded weights in fp8 converting for calculations to fp16 "on the fly"            
+
+Visualy hard to notice diference in quality between fp8 and fp16..
+
+
+!!! The original documentation is below. This version is optimized for speed on GPUs with low VRAM. The following text is for reference only. !!!
 -----
 
 [**Wan: Open and Advanced Large-Scale Video Generative Models**](https://arxiv.org/abs/2503.20314) <be>
@@ -32,6 +136,7 @@ We are excited to introduce **Wan2.2**, a major upgrade to our foundational vide
   <video src="https://github.com/user-attachments/assets/b63bfa58-d5d7-4de6-a1a2-98970b06d9a7" width="70%" poster=""> </video>
 </div>
 
+
 ## 🔥 Latest News!!
 
 * Aug 26, 2025: 🎵 We introduce **[Wan2.2-S2V-14B](https://humanaigc.github.io/wan-s2v-webpage)**, an audio-driven cinematic video generation model, including [inference code](#run-speech-to-video-generation), [model weights](#model-download), and [technical report](https://humanaigc.github.io/wan-s2v-webpage/content/wan-s2v.pdf)! Now you can try it on [wan.video](https://wan.video/),  [ModelScope Gradio](https://www.modelscope.cn/studios/Wan-AI/Wan2.2-S2V) or [HuggingFace Gradio](https://huggingface.co/spaces/Wan-AI/Wan2.2-S2V)!

diff --git a/convert_safetensors.py b/convert_safetensors.py
@@ -0,0 +1,97 @@
+# convert high_noise_model and low_noise_model to bfloat16 to fit one block in 8GB VRAM
+
+import argparse
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+from tqdm import tqdm
+import os
+
+
+def convert_file(input_path: str, output_path: str, target_dtype_str: str):
+    """
+    Loads a safetensors file, converts all floating-point tensors to a new
+    data type, and saves the result to a new file.
+
+    Args:
+        input_path (str): Path to the source .safetensors file.
+        output_path (str): Path to save the converted .safetensors file.
+        target_dtype_str (str): The target dtype as a string (e.g., 'bfloat16').
+    """
+    # Mapping from string representation to torch.dtype object
+    dtype_map = {
+        'float32': torch.float32,
+        'float16': torch.float16,
+        'bfloat16': torch.bfloat16
+    }
+
+    target_dtype = dtype_map.get(target_dtype_str)
+    if target_dtype is None:
+        raise ValueError(
+            f"Unsupported dtype '{target_dtype_str}'. "
+            f"Supported dtypes are: {list(dtype_map.keys())}"
+        )
+
+    print(f"Loading safetensors file from: {input_path}")
+    print(f"Converting floating-point tensors to: {target_dtype_str}")
+
+    # Ensure the output directory exists
+    output_dir = os.path.dirname(output_path)
+    if output_dir:
+        os.makedirs(output_dir, exist_ok=True)
+
+    converted_tensors = {}
+
+    # Use safe_open for memory-efficient loading
+    with safe_open(input_path, framework="pt", device="cpu") as f:
+        # Get all tensor keys and wrap with tqdm for a progress bar
+        tensor_keys = f.keys()
+        for key in tqdm(tensor_keys, desc="Converting tensors"):
+            tensor = f.get_tensor(key)
+
+            # Check if the tensor's dtype is a floating point type
+            if tensor.dtype.is_floating_point:
+                # Convert the tensor to the target dtype
+                converted_tensors[key] = tensor.to(target_dtype)
+            else:
+                # If not a float, keep the original tensor
+                converted_tensors[key] = tensor
+
+    print("Saving converted tensors...")
+    # Save the new dictionary of tensors to the output file
+    save_file(converted_tensors, output_path)
+
+    print(f"\nSuccessfully saved converted file to: {output_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Convert a .safetensors file to a new data type, "
+                    "only affecting floating-point tensors."
+    )
+    parser.add_argument(
+        '--input',
+        type=str,
+        required=True,
+        help='Path to the input .safetensors file.'
+    )
+    parser.add_argument(
+        '--output',
+        type=str,
+        required=True,
+        help='Path to save the new converted .safetensors file.'
+    )
+    parser.add_argument(
+        '--dtype',
+        type=str,
+        required=True,
+        choices=['float32', 'float16', 'bfloat16'],
+        help="The target data type for floating-point tensors."
+    )
+
+    args = parser.parse_args()
+
+    convert_file(args.input, args.output, args.dtype)
+
+
+# python convert_safetensors.py --input "./low_noise_model/diffusion_pytorch_model-00001-of-00006.safetensors" --output "./low_noise_model/diffusion_pytorch_model-00001-of-00006_.safetensors" --dtype "bfloat16"