Can we benefit from sparse sage attention (spargeattn)? #485

dsouzaankit · 2026-01-12T07:22:03Z

dsouzaankit
Jan 12, 2026

Ref: https://github.com/thu-ml/SpargeAttn

naxci1 · 2026-01-14T11:01:07Z

naxci1
Jan 14, 2026

I'm planning to integrate this and run tests today. Theoretically, Sparse Attention+SA2 is the fastest, let's see if it's as fast as they say.

0 replies

naxci1 · 2026-01-14T13:14:52Z

naxci1
Jan 14, 2026

Test resuls:

================================================================================
SpargeAttn/Sage2 Benchmark

Attention Backend Availability:
✅ flash_attn_2
❌ flash_attn_3
✅ sageattn_2
❌ sageattn_3
✅ sparge_sage2
✅ blackwell_gpu
✅ sdpa

GPU: NVIDIA GeForce RTX 5070 Ti (compute capability 12.0)
🚀 Blackwell GPU detected - optimized for RTX 50xx

Configuration:
Heads: 16, Head dim: 64
Warmup: 5, Iterations: 20
Dtype: torch.bfloat16

Batch: 1, Seq length: 4096

Batch: 1, Seq length: 8192

================================================================================
SUMMARY

SDPA Baseline:
Average latency: 3.645 ms
Average memory: 56.1 MB
Average throughput: 2085702 tokens/s

Sage2 (topk=0.3):
Average latency: 1.838 ms
Average memory: 98.4 MB
Average throughput: 3539609 tokens/s

Sage2 (topk=0.5):
Average latency: 1.834 ms
Average memory: 98.4 MB
Average throughput: 3563889 tokens/s

Sage2 (topk=0.7):
Average latency: 1.839 ms
Average memory: 98.4 MB
Average throughput: 3552791 tokens/s

Overall vs SDPA Baseline:
Speed improvement: 1.98x
Memory savings: -75.3%

0 replies

naxci1 · 2026-01-14T13:52:04Z

naxci1
Jan 14, 2026

Here are the results of your tests:

I used 154 frames per second for faster speed, so the differences were minimal. The fastest one right now is sparge_sa2, 3bNvfp4.

sparge_sa2, 3bNvfp4

[17:35:41.970] ────────────────────────
[17:35:41.971] 📊 Peak memory by phase:
[17:35:41.971] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.09GB reserved | RAM 2.38GB
[17:35:41.971] 📊 2. DiT upscaling: VRAM 11.84GB allocated, 13.34GB reserved | RAM 5.58GB
[17:35:41.971] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.34GB reserved | RAM 3.02GB
[17:35:41.971] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.34GB reserved | RAM 3.57GB
[17:35:41.971] 📊 Overall peak: VRAM 11.84GB allocated, 13.34GB reserved | RAM 5.58GB
[17:35:41.971]
[17:35:41.971] ────────────────────────
[17:35:41.971] ⚡ Total execution: 81.60s
[17:35:41.971] ⚡ └─ Video generation: 81.27s
[17:35:41.971] ⚡ └─ Phase 3: VAE decoding: 45.87s
[17:35:41.971] ⚡ └─ Phase 1: VAE encoding: 18.18s
[17:35:41.971] ⚡ └─ Phase 2: DiT upscaling: 15.27s
[17:35:41.971] ⚡ └─ Phase 4: Post-processing: 1.79s
[17:35:41.971] ⚡ └─ Final cleanup: 0.21s
[17:35:41.971] ⚡ └─ Model preparation: 0.10s
[17:35:41.972] ⚡ Average FPS: 1.89 frames/sec
[17:35:41.972]

pop 1.mp4
Prompt executed in 82.78 seconds

sparge_sa2, 3bQ8

[17:38:06.094] 📊 Peak memory by phase:
[17:38:06.094] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.38GB
[17:38:06.094] 📊 2. DiT upscaling: VRAM 12.09GB allocated, 13.50GB reserved | RAM 5.83GB
[17:38:06.094] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.50GB reserved | RAM 3.02GB
[17:38:06.094] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.50GB reserved | RAM 3.57GB
[17:38:06.094] 📊 Overall peak: VRAM 12.09GB allocated, 13.50GB reserved | RAM 5.83GB
[17:38:06.094]
[17:38:06.094] ────────────────────────
[17:38:06.094] ⚡ Total execution: 84.28s
[17:38:06.094] ⚡ └─ Video generation: 83.94s
[17:38:06.094] ⚡ └─ Phase 3: VAE decoding: 46.32s
[17:38:06.094] ⚡ └─ Phase 1: VAE encoding: 18.39s
[17:38:06.094] ⚡ └─ Phase 2: DiT upscaling: 17.32s
[17:38:06.094] ⚡ └─ Phase 4: Post-processing: 1.74s
[17:38:06.095] ⚡ └─ Final cleanup: 0.23s
[17:38:06.095] ⚡ └─ Model preparation: 0.10s
[17:38:06.095] ⚡ Average FPS: 1.83 frames/sec
[17:38:06.095]

Prompt executed in 85.43 seconds

sa2, 3bQ8

[17:41:16.617] 📊 Peak memory by phase:
[17:41:16.617] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.38GB
[17:41:16.617] 📊 2. DiT upscaling: VRAM 12.09GB allocated, 13.94GB reserved | RAM 5.84GB
[17:41:16.617] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.94GB reserved | RAM 3.03GB
[17:41:16.617] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.94GB reserved | RAM 3.57GB
[17:41:16.617] 📊 Overall peak: VRAM 12.09GB allocated, 13.94GB reserved | RAM 5.84GB
[17:41:16.617]
[17:41:16.618] ────────────────────────
[17:41:16.618] ⚡ Total execution: 82.00s
[17:41:16.618] ⚡ └─ Video generation: 81.66s
[17:41:16.618] ⚡ └─ Phase 3: VAE decoding: 45.98s
[17:41:16.618] ⚡ └─ Phase 1: VAE encoding: 18.25s
[17:41:16.618] ⚡ └─ Phase 2: DiT upscaling: 15.51s
[17:41:16.618] ⚡ └─ Phase 4: Post-processing: 1.77s
[17:41:16.618] ⚡ └─ Final cleanup: 0.22s
[17:41:16.618] ⚡ └─ Model preparation: 0.10s
[17:41:16.618] ⚡ Average FPS: 1.88 frames/sec
[17:41:16.618]

Prompt executed in 83.22 seconds

sa2, 3bNvfp4

[17:43:02.780] 📊 Peak memory by phase:
[17:43:02.781] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.38GB
[17:43:02.781] 📊 2. DiT upscaling: VRAM 11.84GB allocated, 13.81GB reserved | RAM 5.58GB
[17:43:02.781] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.81GB reserved | RAM 3.03GB
[17:43:02.781] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.81GB reserved | RAM 3.57GB
[17:43:02.781] 📊 Overall peak: VRAM 11.84GB allocated, 13.81GB reserved | RAM 5.58GB
[17:43:02.781]
[17:43:02.781] ────────────────────────
[17:43:02.781] ⚡ Total execution: 81.89s
[17:43:02.782] ⚡ └─ Video generation: 81.53s
[17:43:02.782] ⚡ └─ Phase 3: VAE decoding: 46.08s
[17:43:02.782] ⚡ └─ Phase 1: VAE encoding: 18.21s
[17:43:02.782] ⚡ └─ Phase 2: DiT upscaling: 15.24s
[17:43:02.782] ⚡ └─ Phase 4: Post-processing: 1.79s
[17:43:02.782] ⚡ └─ Final cleanup: 0.24s
[17:43:02.782] ⚡ └─ Model preparation: 0.10s
[17:43:02.782] ⚡ Average FPS: 1.88 frames/sec
[17:43:02.782]

Prompt executed in 83.03 seconds

sdpa, 3bQ8

[17:45:26.894] 📊 Peak memory by phase:
[17:45:26.894] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.39GB
[17:45:26.894] 📊 2. DiT upscaling: VRAM 11.54GB allocated, 13.53GB reserved | RAM 5.84GB
[17:45:26.894] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.53GB reserved | RAM 3.03GB
[17:45:26.894] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.53GB reserved | RAM 3.57GB
[17:45:26.894] 📊 Overall peak: VRAM 11.54GB allocated, 13.53GB reserved | RAM 5.84GB
[17:45:26.894]
[17:45:26.894] ────────────────────────
[17:45:26.895] ⚡ Total execution: 83.16s
[17:45:26.895] ⚡ └─ Video generation: 82.82s
[17:45:26.895] ⚡ └─ Phase 3: VAE decoding: 45.93s
[17:45:26.895] ⚡ └─ Phase 1: VAE encoding: 18.12s
[17:45:26.895] ⚡ └─ Phase 2: DiT upscaling: 16.84s
[17:45:26.895] ⚡ └─ Phase 4: Post-processing: 1.77s
[17:45:26.895] ⚡ └─ Final cleanup: 0.22s
[17:45:26.895] ⚡ └─ Model preparation: 0.10s
[17:45:26.895] ⚡ Average FPS: 1.85 frames/sec
[17:45:26.895]

Prompt executed in 84.29 seconds

sdpa, 3bNvfp4

[17:47:44.675] 📊 Peak memory by phase:
[17:47:44.675] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.39GB
[17:47:44.676] 📊 2. DiT upscaling: VRAM 11.29GB allocated, 13.75GB reserved | RAM 5.59GB
[17:47:44.676] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.75GB reserved | RAM 3.03GB
[17:47:44.676] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.75GB reserved | RAM 3.58GB
[17:47:44.676] 📊 Overall peak: VRAM 11.29GB allocated, 13.75GB reserved | RAM 5.59GB
[17:47:44.676]
[17:47:44.676] ────────────────────────
[17:47:44.676] ⚡ Total execution: 83.59s
[17:47:44.676] ⚡ └─ Video generation: 83.24s
[17:47:44.676] ⚡ └─ Phase 3: VAE decoding: 46.33s
[17:47:44.676] ⚡ └─ Phase 1: VAE encoding: 18.23s
[17:47:44.676] ⚡ └─ Phase 2: DiT upscaling: 16.69s
[17:47:44.676] ⚡ └─ Phase 4: Post-processing: 1.81s
[17:47:44.677] ⚡ └─ Final cleanup: 0.24s
[17:47:44.677] ⚡ └─ Model preparation: 0.10s
[17:47:44.677] ⚡ Average FPS: 1.84 frames/sec
[17:47:44.677]

Prompt executed in 84.83 seconds

fa2, 3bNvfp4

[17:49:36.593] 📊 Peak memory by phase:
[17:49:36.593] 📊 1. VAE encoding: VRAM 9.11GB allocated, 11.06GB reserved | RAM 2.39GB
[17:49:36.593] 📊 2. DiT upscaling: VRAM 11.02GB allocated, 13.75GB reserved | RAM 5.60GB
[17:49:36.593] 📊 3. VAE decoding: VRAM 10.46GB allocated, 13.75GB reserved | RAM 3.04GB
[17:49:36.593] 📊 4. Post-processing: VRAM 5.46GB allocated, 13.75GB reserved | RAM 3.59GB
[17:49:36.593] 📊 Overall peak: VRAM 11.02GB allocated, 13.75GB reserved | RAM 5.60GB
[17:49:36.593]
[17:49:36.593] ────────────────────────
[17:49:36.593] ⚡ Total execution: 81.89s
[17:49:36.593] ⚡ └─ Video generation: 81.54s
[17:49:36.594] ⚡ └─ Phase 3: VAE decoding: 45.96s
[17:49:36.594] ⚡ └─ Phase 1: VAE encoding: 18.23s
[17:49:36.594] ⚡ └─ Phase 2: DiT upscaling: 15.38s
[17:49:36.594] ⚡ └─ Phase 4: Post-processing: 1.81s
[17:49:36.594] ⚡ └─ Final cleanup: 0.23s
[17:49:36.594] ⚡ └─ Model preparation: 0.10s
[17:49:36.594] ⚡ Average FPS: 1.88 frames/sec
[17:49:36.594]

Prompt executed in 83.09 seconds

22 replies

naxci1 Jan 25, 2026

@dsouzaankit

naxci1#32

You can review the operations, test results, bugs, and fixes performed on just one branch yourself.

skv89 Jan 25, 2026

I tested the Nvfp4 tensors in the main branch with SageAttn2 and it is the same speed as 3bQ8, but the quality is worst than 3bQ8.

naxci1 Jan 25, 2026

The quality couldn't be worse because I converted the 3D FP16 model; it's the same quality, I'm using it now and it's slightly faster.

naxci1 Jan 25, 2026

Here's a screenshot from an upscale I did today; the one on the right is the original, and the one on the left is a 3x upscale image created using NVFP4.

skv89 Jan 26, 2026

3bQ8 is slightly better quality than 3bfp16 iirc. I actually didn't notice any speed improvements from using the nvfp4. It finished in about the same time as 3bQ8 but not sure if my setup is any different from yours. Right now I'm trying to downgrade my pytorch to 2.7.1 after seeing your post about pytorch 2.7.1 being faster than 2.9 and I was always on 2.9.1 so maybe that could be the reason. But I'm struggling with getting Sage2 on pytorch 2.7.1 + cu128 + python 3.12 because there arent any SageAttn 2 for this setup. Can you share what version of pytorch, cuda, and python you are on?

dsouzaankit · 2026-01-14T21:43:13Z

dsouzaankit
Jan 14, 2026
Author

Nice work! I see you updated Comfy gui too.
[sparge_sa2, 3bNvfp4] is really sparge_sa3 (Blackwell FP4), right?
I am thinking we need full Cuda Kernels for Torch.compile + sparse sageattn3?
We can test on cloud B200/B300 hourly instances for economy prices here.
B300 build needs code changes (compute capability level related) here though. (For now, thinking to replace sm_100 with sm_103 to test)

0 replies

naxci1 · 2026-01-26T09:46:32Z

naxci1
Jan 26, 2026

Actually, logically, 3bfp16 was converted to 3bQ8. Q8 means the same quality is maintained at 1:1, while K6 means slightly less quality is maintained, meaning the model is slightly smaller. However, logically, 3bfp16 = 3bQ8 should have the same quality.

I use ComfyUI with PyTorch 2.7.1=cu128, an older ComfyUI version, while the one I use for testing purposes has the newer ComfyUI version, PyTorch 2.9.1 + cu130. Theoretically, the newer one should work faster, but in reality, the opposite is true; the old one is faster, so I use the old one every day. I only use the new one for testing.

This is my old system:


ComfyUI version: 0.3.68


--- Python & Packages ---

[PYTHON INFORMATION]
Version: 3.12.10
Path: C:\ComfyUI\python_embeded\python.exe

[PYTORCH INFORMATION]
PyTorch: 2.7.1+cu128
CUDA: 12.8
cuDNN: 90701
CUDA available: Yes

[INSTALLED CORE PACKAGES]
[OK] triton          3.3.1
[--] xformers        Not installed
[OK] flash_attn      2.8.1
[OK] bitsandbytes    0.46.1
[OK] accelerate      1.8.1
[OK] transformers    4.53.2
[OK] diffusers       0.34.0
[OK] safetensors     0.5.3
[OK] einops          0.8.1
[OK] torchsde        0.2.6
[OK] kornia          0.8.1
[--] fastapi         Not installed

[ACTIVE OPTIMIZATIONS]
[OK] Torch Compile
[OK] Flash Attention (built-in)
[NO] xFormers

This is my new system for testing:

ComfyUI version: 0.10


--- Python & Packages ---

[PYTHON INFORMATION]
Version: 3.12.10
Path: D:\ComfyUI\python_embeded\python.exe

[PYTORCH INFORMATION]
PyTorch: 2.9.1+cu130
CUDA: 13.0
cuDNN: 91200
CUDA available: Yes

[INSTALLED CORE PACKAGES]
[OK] triton          3.5.1
[--] xformers        Not installed
[OK] flash_attn      2.8.3
[OK] bitsandbytes    0.49.0
[OK] accelerate      1.12.0
[OK] transformers    4.57.3
[OK] diffusers       0.36.0
[OK] safetensors     0.7.0
[OK] einops          0.8.1
[OK] torchsde        0.2.6
[OK] kornia          0.8.2
[OK] fastapi         0.128.0

[ACTIVE OPTIMIZATIONS]
[OK] Torch Compile
[OK] Flash Attention (built-in)
[NO] xFormers

0 replies

brumskysb · 2026-01-27T22:28:24Z

brumskysb
Jan 27, 2026

Is it possible you could share the NVFP4 3b model? Would it be possible to create a NVFP4 7b model for educational purposes?

Edit:
I did find this link, which has a number of Q based models. https://huggingface.co/cmeka/SeedVR2-GGUF/tree/main

Also, this one which has some NVFP4 models. https://huggingface.co/Nexus24/vaeGGUF/tree/main
I previously copied the wrong link for the NVFP4 models I had found. Which were naxci1's models haha.

I've downloaded them and placed them into the seedvr2 folder under models. However, they are not showing up. Any idea how I can get them to be available in the drop down list?

Edit2:
Nevermind, I had to refresh the damn browser. I can see the models now.

11 replies

You can follow the steps via this link and test it yourself.

naxci1/ComfyUI-SeedVR2.5_new#1

Thank you for working on this. I can't wait to see the results.

brumskysb Feb 2, 2026

@naxci1 how's the NVFP4 work coming along?

Uh oh!

Can we benefit from sparse sage attention (spargeattn)? #485

Uh oh!

Replies: 6 comments · 33 replies

Uh oh!

Uh oh!

================================================================================ SpargeAttn/Sage2 Benchmark

Batch: 1, Seq length: 4096

Batch: 1, Seq length: 8192

================================================================================ SUMMARY

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsouzaankit Jan 14, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 33 replies

================================================================================
SpargeAttn/Sage2 Benchmark

================================================================================
SUMMARY

dsouzaankit
Jan 14, 2026
Author