Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b6dd251
Test checkpoint 1
nalexand Sep 9, 2025
1d22e77
Run localy on 8GB VRAM
nalexand Sep 10, 2025
df454ae
Merge pull request #1 from nalexand/optimizations
nalexand Sep 10, 2025
e5146b2
Update README.md
nalexand Sep 10, 2025
7b0be1a
Update README.md
nalexand Sep 11, 2025
f5a31f1
fix layers loading
nalexand Sep 11, 2025
cd6fe12
Merge pull request #2 from nalexand/optimizations
nalexand Sep 11, 2025
328b1fb
Use free mem to load more blocks and speed up
nalexand Sep 14, 2025
7aa1425
Merge pull request #3 from nalexand/optimizations
nalexand Sep 14, 2025
3bb8968
Update README.md
nalexand Sep 14, 2025
582af3d
Add image to video long loop generation
nalexand Oct 2, 2025
83fcaad
Merge pull request #4 from nalexand/optimizations
nalexand Oct 2, 2025
1641bfb
Update README.md
nalexand Oct 2, 2025
40889d5
Update README.md
nalexand Oct 2, 2025
18772af
Update README.md
nalexand Oct 5, 2025
903e080
Update README.md
nalexand Oct 5, 2025
002a6b6
add optimizations
nalexand Oct 15, 2025
aed71d5
More optimizations
nalexand Oct 18, 2025
1fbb01d
Merge pull request #5 from nalexand/optimizations
nalexand Oct 18, 2025
31abbee
Update README.md
nalexand Oct 18, 2025
a45b9a1
Update README.md
nalexand Oct 18, 2025
678034b
Update README.md
nalexand Oct 18, 2025
b013a70
Update README.md
nalexand Oct 18, 2025
d53ac57
Update README.md
nalexand Oct 18, 2025
cd7d842
Update README.md
nalexand Oct 22, 2025
ff5b96a
Update README.md
nalexand Oct 22, 2025
5fb697c
speedup vae decode 5x on low vram by offloading cache tensor to cpu
nalexand Oct 23, 2025
683d988
Merge pull request #6 from nalexand/optimizations
nalexand Oct 23, 2025
0827c07
Update README.md
nalexand Oct 23, 2025
8e1db60
remove debug
nalexand Oct 25, 2025
9bfc11c
Merge pull request #7 from nalexand/optimizations
nalexand Oct 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,110 @@
📕 <a href="https://alidocs.dingtalk.com/i/nodes/jb9Y4gmKWrx9eo4dCql9LlbYJGXn6lpz">使用指南(中文)</a>&nbsp&nbsp | &nbsp&nbsp 📘 <a href="https://alidocs.dingtalk.com/i/nodes/EpGBa2Lm8aZxe5myC99MelA2WgN7R35y">User Guide(English)</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat(微信)</a>&nbsp&nbsp
<br>


<h1>How to run Wan2.2 locally on 8 GB VRAM (!!model code changed!!)</h1>
<ol>
<li> huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B</li>
<li> convert high_noise_model and low_noise_model to float16 to fit one block in 8GB VRAM with <strong>convert_safetensors.py</strong> </li>
<li> run <strong>optimize_files.py</strong> - split safetensors files by modules (run after convert_safetensors.py)</li>
<li> python <strong>generate_local.py</strong> --task t2v-A14B --size "1280*720" --ckpt_dir ./Wan2.2-T2V-A14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."</li>
</ol>

<p>Generated frames are limited to 21 (1.3 sec 1280*704) to fit within 8 GB VRAM.</p>
<p>For full 5 second video max resolution: 720*405 (on 8Gb VRAM)</p>
<p></p>* tested on <b>HELIOS PREDATOR 300</b> laptop (3070Ti 8GB) 1280*704 - 72.22 s/it for 21 frames, 60.74 s/it for 17 frames, 13 frames 48.70 s/it </p>

Wan2.2 on 8GB VRAM: Run Advanced AI Video Generation Locally! (Optimization Guide) https://youtu.be/LlqnghCNxXM

## UPDATE 10/02/2025
- **Optimized I2V-A14B** run long video generation loop with **loop.bat**

https://github.com/user-attachments/assets/154df173-88d3-4ad1-b543-f7410380b13a



## How it works
- The setup is the same as for the T2V model: huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B, then run the convert and optimize scripts.
- run example: **python generate_local.py --task i2v-A14B --size "1280*720" --image=./last_frame.png --ckpt_dir ./Wan2.2-I2V-A14B --prompt "In close-up, a cheetah runs at full speed in a narrow canyon, its golden fur gleaming in the sun, and its black tear marks clearly visible. Shot from a low angle, the cheetah's body is close to the ground, its muscles flowing, and its limbs alternately and powerfully step over stones and soil, stirring up dust. The cheetah's eyes are sharp, staring at the target in front of it, showing unparalleled speed and strength. The camera follows the cheetah's running trajectory, capturing every moment of leaping and turning, showing its amazing agility. The whole scene unfolds in a tense chase rhythm, full of wild charm and competition for survival."**
- or edit prompt in **loop.bat** and run (command runs in loop, each iteration do one spep: create latent from image -> y_latents.pt, run inference -> final_latents.pt, decode video final_latents.pt -> last_frame_latents.pt, create latent from last frame last_frame_latents.pt -> y_latents.pt, run inference ...)
- **to start new generation loop** with new image / prompt / frame count / size - delete: **y_latents.pt**, **final_latents.pt**, **last_frame_latents.pt**

## Results on a 3070 Ti laptop GPU with 8 GB VRAM + 25 GB free RAM (some layers are loaded from the NVME drive; to fit everything in RAM, 30 GB of free RAM is needed):
# size 640*352
# 81 frames 58.23 s/it 51.32 s/it (*FP8)
# 33 frames 23.75 s/it vae decode 4.5 sec

# 704 * 396, sampling_steps 25+
# frame_num = 49 24.72 s/it (FP16)
# frame_num = 81 77.50 s/it (FP16)

# size 720*405, sampling_steps 20+
# frame_num = 17 21.23 s/it (FP16) vae decode 5.4 sec
# frame_num = 77 82.11 s/it (FP16)
# frame_num = 81 (best) 70.74 s/it (*FP8) vae decode 12.2 sec

# size 832*464 / 848*448, sampling_steps 20+
# frame_num = 17 23.68 s/it vae decode 3.54 sec
# frame_num = 53 74.34 s/it
# 65 79.73 s/it

# size 960*540, sampling_steps 16+
# 17 frames 34.30 s/it (FP16)
# 41 frames 75.02 s/it (FP16)
# 45 frames 72.35 s/it (*FP8) vae decode 11.7 sec

# size = 1120 * 630
# 13 frames 29.24 s/it (*FP8)
# 17 frames 37.90 s/it (*FP8) vae decode 13.6 sec
# 33 frames (max) 85.10 s/it (FP16)
# 33 frames 76.49 s/it (*FP8)
# 37 85.16 s/it (*FP8)

######################################################
# for 8gb vram and sizes > 1120 * 630 vae use slow shared video memory

# size 1280*720, sampling_steps 16+
# 13 frames 48.70 s/it (FP16)
# 13 frames 39.61 s/it (*FP8) vae decode 17.4 sec
# 17 frames 60.74 s/it (FP16)
# 17 frames 54.02 s/it (*FP8)
# 21 frames (max) 72.22 s/it (FP16)
# 21 frames 66.18 s/it (*FP8) vae decode 28 sec

# size 1600*896 / 1568*896, sampling_steps 15+
# 13 frames (max) 85.47 s/it (FP16)
# 13 frames (max) 63.88 s/it (*FP8) vae decode ~115 sec

self.offload_large_tensors = False # slower 20% inference but more frames per video
# ################# large tensors offloading ##########################

# 1280*720
# 33 frames 118.83 s/it (*FP8) vae decode 38 sec

# 1568*896
# 21 frames 127.01 s/it (*FP8) vae decode: 182 sec

# Compared to ComfyUA
ComfyUA (fp8) This (fp16) optimized vae
1120*630 33 frames * 16 steps 1470 sec 85.10 s/it * 16 = 1362 sec
vae decode +117 sec +26 sec
total 1587 sec 1388 sec 1.14x faster

This (*fp8) optimized vae
76.49 s/it * 16 = 1224 sec
+26 sec
1282 sec 1.27x faster !!!

1568*896 13 frames * 10 steps 69.31 s/it 63.88 s/it * 10 = 638.8 sec
OOM +115 sec
922.8 sec

*fp8 - 3070 Ti doesn`t support calculations in fp8, loaded weights in fp8 converting for calculations to fp16 "on the fly"

Visualy hard to notice diference in quality between fp8 and fp16..


!!! The original documentation is below. This version is optimized for speed on GPUs with low VRAM. The following text is for reference only. !!!
-----

[**Wan: Open and Advanced Large-Scale Video Generative Models**](https://arxiv.org/abs/2503.20314) <be>
Expand All @@ -32,6 +136,7 @@ We are excited to introduce **Wan2.2**, a major upgrade to our foundational vide
<video src="https://github.com/user-attachments/assets/b63bfa58-d5d7-4de6-a1a2-98970b06d9a7" width="70%" poster=""> </video>
</div>


## 🔥 Latest News!!

* Aug 26, 2025: 🎵 We introduce **[Wan2.2-S2V-14B](https://humanaigc.github.io/wan-s2v-webpage)**, an audio-driven cinematic video generation model, including [inference code](#run-speech-to-video-generation), [model weights](#model-download), and [technical report](https://humanaigc.github.io/wan-s2v-webpage/content/wan-s2v.pdf)! Now you can try it on [wan.video](https://wan.video/), [ModelScope Gradio](https://www.modelscope.cn/studios/Wan-AI/Wan2.2-S2V) or [HuggingFace Gradio](https://huggingface.co/spaces/Wan-AI/Wan2.2-S2V)!
Expand Down
97 changes: 97 additions & 0 deletions convert_safetensors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# convert high_noise_model and low_noise_model to bfloat16 to fit one block in 8GB VRAM

import argparse
import torch
from safetensors import safe_open
from safetensors.torch import save_file
from tqdm import tqdm
import os


def convert_file(input_path: str, output_path: str, target_dtype_str: str):
"""
Loads a safetensors file, converts all floating-point tensors to a new
data type, and saves the result to a new file.

Args:
input_path (str): Path to the source .safetensors file.
output_path (str): Path to save the converted .safetensors file.
target_dtype_str (str): The target dtype as a string (e.g., 'bfloat16').
"""
# Mapping from string representation to torch.dtype object
dtype_map = {
'float32': torch.float32,
'float16': torch.float16,
'bfloat16': torch.bfloat16
}

target_dtype = dtype_map.get(target_dtype_str)
if target_dtype is None:
raise ValueError(
f"Unsupported dtype '{target_dtype_str}'. "
f"Supported dtypes are: {list(dtype_map.keys())}"
)

print(f"Loading safetensors file from: {input_path}")
print(f"Converting floating-point tensors to: {target_dtype_str}")

# Ensure the output directory exists
output_dir = os.path.dirname(output_path)
if output_dir:
os.makedirs(output_dir, exist_ok=True)

converted_tensors = {}

# Use safe_open for memory-efficient loading
with safe_open(input_path, framework="pt", device="cpu") as f:
# Get all tensor keys and wrap with tqdm for a progress bar
tensor_keys = f.keys()
for key in tqdm(tensor_keys, desc="Converting tensors"):
tensor = f.get_tensor(key)

# Check if the tensor's dtype is a floating point type
if tensor.dtype.is_floating_point:
# Convert the tensor to the target dtype
converted_tensors[key] = tensor.to(target_dtype)
else:
# If not a float, keep the original tensor
converted_tensors[key] = tensor

print("Saving converted tensors...")
# Save the new dictionary of tensors to the output file
save_file(converted_tensors, output_path)

print(f"\nSuccessfully saved converted file to: {output_path}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Convert a .safetensors file to a new data type, "
"only affecting floating-point tensors."
)
parser.add_argument(
'--input',
type=str,
required=True,
help='Path to the input .safetensors file.'
)
parser.add_argument(
'--output',
type=str,
required=True,
help='Path to save the new converted .safetensors file.'
)
parser.add_argument(
'--dtype',
type=str,
required=True,
choices=['float32', 'float16', 'bfloat16'],
help="The target data type for floating-point tensors."
)

args = parser.parse_args()

convert_file(args.input, args.output, args.dtype)


# python convert_safetensors.py --input "./low_noise_model/diffusion_pytorch_model-00001-of-00006.safetensors" --output "./low_noise_model/diffusion_pytorch_model-00001-of-00006_.safetensors" --dtype "bfloat16"
Loading