Skip to content

Commit 7adbe5f

Browse files
m96-chanclaude
andauthored
Release v0.2.19 (#188)
* feat: add cuBLAS dynamic loader and C++ kernel profiler (#134, #150) ## cuBLAS Dynamic Loader (Issue #134) - Dynamic loading of cuBLAS library (cublas64_13.dll / libcublas.so) - Supports GEMM: sgemm, dgemm, hgemm, gemm_ex (mixed precision) - Supports GEMV: sgemv, dgemv - Row-major convenience wrappers for Python API - Python bindings: cublas_is_available, cublas_get_version, cublas_test_* ## C++ Kernel Profiler (Issue #150) - Native C++ profiler using CUDA Driver API (cuEvent*) - ScopedTimer class for RAII-based timing - KernelProfiler for aggregating multiple kernel records - Python bindings with automatic native backend detection - Chrome trace export support Test results (RTX 5090, CUDA 13.1): - cuBLAS loaded: cublas64_13.dll v13.2.0 - SGEMM/HGEMM/DGEMM: all pass - Profiler: native C++ backend active 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(llm): add lazy model loading with streaming strategies (#159) Add memory-mapped model loading with on-demand GPU loading for large models (70B+). ## Core Implementation (Rust) - LazyTensor: GPU caching with LRU eviction - LazyModelLoader: Multi-file SafeTensors loader with memory budgeting - TensorState enum: OnDisk, Loading, OnGpu, Evicted - Layer management: get_layer_tensors, layer_size, is_layer_loaded, layer_state ## Loading Strategies (Python) - SimpleStreaming: Load/unload each layer (minimal VRAM) - SlidingWindow: Keep N layers, prefetch ahead (balanced) - AutoLRU: Automatic LRU eviction (best performance) ## API - LazyModelLoader(memory_budget, enable_eviction) - LayerStreamingContext for managed streaming - create_streaming_context() factory function ## Usage ```python loader = LazyModelLoader(memory_budget=8 * 1024**3) loader.load_file("model.safetensors") with LayerStreamingContext(loader, SlidingWindow(4), num_layers=32) as ctx: for i in range(32): ctx.prepare(i) hidden = layers[i](hidden) ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): resolve ruff B027 and UP037 errors in streaming.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * style: apply ruff format to streaming.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): resolve ruff errors in profiling module - Remove unused imports (F401) - Fix f-string without placeholders (F541) - Organize imports (I001) - Remove unnecessary mode argument (UP015) - Fix redefinition of unused import (F811) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tests): add skip markers for profiling tests requiring CUDA Tests that require native CUDA module are now skipped when running in CI environment without GPU support. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add image generation module for SD3, Flux, PixArt (#177) Implements complete diffusion model support for text-to-image generation: Models: - DiT (Diffusion Transformer) with AdaLN conditioning - SD3Transformer (MMDiT architecture) - FluxTransformer with guidance embedding - VAE encoder/decoder with SafeTensors loading Schedulers: - EulerDiscreteScheduler (SDXL-style) - DDIMScheduler (deterministic/stochastic) - FlowMatchingScheduler (Rectified Flow for SD3/Flux) Operations: - GroupNorm (CPU fallback) - Cross-Attention (non-causal) - Conv2D / Conv2DTranspose (im2col) - AdaLN / AdaLN-Zero - Sinusoidal timestep embedding Text Encoders: - CLIPTextEncoder (OpenCLIP-style) - T5Encoder (T5-XXL for SD3/Flux) Pipeline: - Text2ImagePipeline with unified interface - Demo mode (works without model weights) - Batch generation support Example: - examples/image_generate.py with CLI interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(diffusion): resolve mypy type errors in text encoders Fix variable shadowing issue where input_ids/attention_mask were first defined as lists then reassigned to numpy arrays, confusing mypy. - Add explicit type annotations for input_ids and attention_mask - Rename intermediate list variables to ids_list and mask_list 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add native CUDA kernels for image generation ops Implement CUDA kernels for diffusion model operations: - GroupNorm: F32/BF16/FP16 variants for VAE/UNet - AdaLN/AdaLN-Zero: Adaptive Layer Normalization for DiT - Cross-Attention: Non-causal attention for text-to-image - Conv2D: im2col, col2im, 1x1 and 3x3 direct convolutions Files added: - native/ops/nn/diffusion/: groupnorm, adaln, cross_attention, conv2d kernels - native/bindings/nn/diffusion.cpp: pybind11 bindings Python ops updated to use native kernels when available: - group_norm.py, adaln.py, cross_attention.py, conv2d.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(diffusion): fix PixArt-Sigma model loading and inference - Fix out_channels from 4 to 8 for PixArt-Sigma (noise + variance) - Add transformer subdirectory detection for HuggingFace diffusers format - Add sharded T5 encoder detection with fallback to random embeddings - Extract first 4 channels from 8-channel noise prediction Tested with PixArt-Sigma-XL-2-512-MS: - 10 steps in 24.49s (2.449s/step) - Output: output/pixart_test.png 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add HuggingFace T5 encoder with sharded safetensors support - Add HFT5Encoder class using transformers library for proper T5 encoding - Support sharded safetensors loading via Python safetensors library - Auto-detect tokenizer in parent/tokenizer directory - CPU fallback when PyTorch doesn't support GPU (e.g., RTX 5090) - Update pipeline to prefer HFT5Encoder over simple T5Encoder Tested with PixArt-Sigma + T5-XXL: - T5 encoder on CPU (PyTorch lacks SM120 support) - Diffusion model on GPU via PyGPUkit - 20 steps in 55.9s (2.795s/step) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add batched_matmul loop fallback for SM120 - Add _batched_matmul_loop() for when CUTLASS fails (SM120) - Use batched_matmul in T5 self-attention (80s -> 30s) - Remove HFT5Encoder (PyTorch dependency) - T5 now uses native GPU matmul operations Performance (RTX 5090, SM120): - T5-XXL encoding: 80s -> 30s (2.7x speedup) - batched_matmul [64,512,64]@[64,64,512]: 45ms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add FLUX.1 transformer implementation Implements FLUX.1-schnell text-to-image generation: - FluxTransformer with 19 joint + 38 single blocks - Joint attention (image-text cross-attention) - Single attention (self-attention on concatenated sequence) - Flow matching Euler scheduler - GPU-native ops for linear, transpose, matmul, softmax Optimizations: - GPU-native transpose_4d_0213 (18x faster than numpy) - GPU-native transpose_3d_012 for K^T (22x faster) - RoPE frequency caching to avoid recomputation Known limitations: - Modulation, layer_norm, gated_residual use numpy fallback - Generation time ~420s (vs ~3s diffusers) - needs broadcast kernels 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): remove unused variables in DiT and FLUX models - Remove unused N variable in dit/model.py - Fix unused conditioning variable in dit/adaln.py - Remove unused imports in flux/blocks.py - Remove unused x_np in flux/model.py - Add DiT transformer components (PixArt architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): remove orphaned #endif in diffusion kernels The files use #pragma once but had orphaned #endif statements causing compilation errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): use nbytes() instead of size_bytes() in diffusion.inl GPUArray uses nbytes() method, not size_bytes(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): use device_memset wrapper instead of cudaMemset Use the project's device_memset wrapper for CUDA API abstraction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update README for v0.2.19 - Add FLUX.1 image generation section - Add DiT architecture support documentation - Add new GPU operations for diffusion - Update roadmap with v0.2.19 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: expand v0.2.19 release notes Add missing features: - Lazy model loading with streaming strategies - cuBLAS dynamic loader - C++ kernel profiler - HuggingFace T5 encoder support - Additional GPU operations (cross_attention, conv2d, group_norm) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: bump version to 0.2.19 - pyproject.toml: 0.2.18 -> 0.2.19 - benchmark/results.py: 0.2.18 -> 0.2.19 - Apply ruff format to diffusion modules 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 0b5c13e commit 7adbe5f

File tree

7 files changed

+148
-13
lines changed

7 files changed

+148
-13
lines changed

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,114 @@ They were all observed in production or real benchmarks.
9999

100100
---
101101

102+
## What's New in v0.2.19
103+
104+
### FLUX.1 Image Generation
105+
Text-to-image generation with Black Forest Labs' FLUX.1 model:
106+
107+
```python
108+
from pygpukit.diffusion import FluxPipeline
109+
110+
# Load FLUX.1-schnell (fast variant)
111+
pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell")
112+
113+
# Generate image
114+
image = pipeline.generate(
115+
prompt="a photo of a cat sitting on a windowsill",
116+
num_inference_steps=4, # schnell uses few steps
117+
guidance_scale=0.0, # schnell doesn't use CFG
118+
)
119+
image.save("output.png")
120+
```
121+
122+
| Component | Description |
123+
|-----------|-------------|
124+
| **FluxTransformer** | 19 joint blocks + 38 single blocks |
125+
| **FluxScheduler** | Flow matching Euler scheduler |
126+
| **GPU-native ops** | Transpose, batched matmul, RoPE on GPU |
127+
| **RoPE frequencies** | Cached on GPU for efficient reuse |
128+
129+
### Lazy Model Loading with Streaming
130+
Memory-efficient model loading strategies for large models:
131+
132+
```python
133+
from pygpukit.llm import QwenModel, StreamingStrategy
134+
135+
# Progressive loading - load layers as needed
136+
model = QwenModel.from_safetensors(
137+
"path/to/model",
138+
streaming=StreamingStrategy.PROGRESSIVE
139+
)
140+
141+
# Layer-by-layer streaming for memory-constrained environments
142+
model = QwenModel.from_safetensors(
143+
"path/to/model",
144+
streaming=StreamingStrategy.LAYER_BY_LAYER
145+
)
146+
```
147+
148+
| Strategy | Description |
149+
|----------|-------------|
150+
| `EAGER` | Load all weights at once (default) |
151+
| `PROGRESSIVE` | Load weights progressively during first forward |
152+
| `LAYER_BY_LAYER` | Stream one layer at a time, minimal memory |
153+
154+
### cuBLAS Dynamic Loader
155+
Runtime cuBLAS/cuBLASLt loading without compile-time CUDA Toolkit dependency:
156+
157+
| Feature | Description |
158+
|---------|-------------|
159+
| **Dynamic DLL loading** | Searches CUDA_PATH, system PATH |
160+
| **Version detection** | Auto-selects cublasLt64_13/12/11.dll |
161+
| **Graceful fallback** | Uses native kernels if cuBLAS unavailable |
162+
163+
### C++ Kernel Profiler
164+
Built-in CUDA kernel profiling with minimal overhead:
165+
166+
```python
167+
from pygpukit import enable_profiling, get_profile_stats
168+
169+
enable_profiling(True)
170+
# ... run your code ...
171+
stats = get_profile_stats()
172+
for name, info in stats.items():
173+
print(f"{name}: {info['avg_ms']:.3f} ms ({info['count']} calls)")
174+
```
175+
176+
### HuggingFace T5 Encoder Support
177+
T5 text encoder with sharded safetensors for FLUX/SD3:
178+
179+
| Feature | Description |
180+
|---------|-------------|
181+
| **Sharded loading** | Supports `model-00001-of-00002.safetensors` format |
182+
| **T5EncoderModel** | Full T5 encoder implementation |
183+
| **Automatic detection** | Finds encoder in model directories |
184+
185+
### DiT Architecture Support
186+
Diffusion Transformer (DiT) components for PixArt and similar models:
187+
188+
| Module | Description |
189+
|--------|-------------|
190+
| `dit/model.py` | PixArt transformer with AdaLN-Zero |
191+
| `dit/attention.py` | Self/cross attention with GQA |
192+
| `dit/embeddings.py` | Patch embed, timestep embed, 2D sincos pos |
193+
| `dit/adaln.py` | Adaptive LayerNorm modulation |
194+
| `dit/ffn.py` | GEGLU feed-forward network |
195+
196+
### New GPU Operations
197+
| Operation | Description |
198+
|-----------|-------------|
199+
| `transpose_4d_0213` | GPU-native 4D transpose [B,S,H,D] -> [B,H,S,D] |
200+
| `transpose_3d_012` | GPU-native 3D transpose [B,S,D] -> [B,D,S] |
201+
| `gpu_batched_matmul` | Batched matrix multiplication |
202+
| `gpu_softmax` | GPU-native softmax |
203+
| `gpu_apply_rope` | Apply rotary position embedding |
204+
| `cross_attention` | Cross-attention for text conditioning |
205+
| `conv2d` | 2D convolution for VAE/UNet |
206+
| `group_norm` | Group normalization |
207+
208+
---
209+
102210
## What's New in v0.2.18
103211

104212
### Major Codebase Refactoring
@@ -595,6 +703,7 @@ PyGPUkit/
595703
| **v0.2.16** | **MoE support** (Mixtral), Thinking models (Qwen3), W8A8/W4A4 GEMV, W8A16/Int8/Int4 GEMM, Kernel restructure |
596704
| **v0.2.17** | **Triton backend** MVP, hybrid execution (Triton + Native CUDA), TritonArray wrapper |
597705
| **v0.2.18** | **Codebase refactoring**, Kokoro TTS, Positional encoding (PoPE/ALiBi/YaRN/NTK), ReLU², Unified benchmark, BF16 GEMV (98% BW), W8A16 fix |
706+
| **v0.2.19** | **FLUX.1 image generation**, Lazy model loading (streaming), cuBLAS dynamic loader, C++ kernel profiler, T5 encoder, DiT architecture, GPU-native diffusion ops |
598707

599708
### Planned
600709

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "scikit_build_core.build"
44

55
[project]
66
name = "PyGPUkit"
7-
version = "0.2.18"
7+
version = "0.2.19"
88
description = "A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API"
99
readme = "README.md"
1010
license = "MIT"

src/pygpukit/benchmark/results.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ class BenchmarkReport:
5757
gpu: GPUInfo
5858
results: list[BenchmarkResult] = field(default_factory=list)
5959
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
60-
version: str = "0.2.18"
60+
version: str = "0.2.19"
6161

6262
def add(self, result: BenchmarkResult) -> None:
6363
self.results.append(result)

src/pygpukit/diffusion/models/dit/embeddings.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,10 @@ def get_2d_sincos_pos_embed(embed_dim: int, grid_size: int | tuple[int, int]) ->
6464

6565
# Create 2D grid in column-major order (h varies first)
6666
# This matches diffusers: for each column, iterate through rows
67-
h_grid, w_grid = np.meshgrid(grid_h_pos, grid_w_pos, indexing='ij')
67+
h_grid, w_grid = np.meshgrid(grid_h_pos, grid_w_pos, indexing="ij")
6868
# Flatten in Fortran order (column-major) to match diffusers patch ordering
69-
h_flat = h_grid.flatten('F') # [H*W]
70-
w_flat = w_grid.flatten('F') # [H*W]
69+
h_flat = h_grid.flatten("F") # [H*W]
70+
w_flat = w_grid.flatten("F") # [H*W]
7171

7272
# Get embeddings for each dimension
7373
emb_h = sinusoidal_embedding(h_flat, embed_dim // 2) # height embedding

src/pygpukit/diffusion/models/dit/model.py

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,11 @@ def from_safetensors(
9898

9999
# Detect spec from weights
100100
hidden_size = weights["pos_embed.proj.bias"].shape[0]
101-
num_blocks = sum(1 for k in weights if k.startswith("transformer_blocks.") and k.endswith(".attn1.to_q.weight"))
101+
num_blocks = sum(
102+
1
103+
for k in weights
104+
if k.startswith("transformer_blocks.") and k.endswith(".attn1.to_q.weight")
105+
)
102106

103107
spec = PixArtSpec(
104108
name="pixart_sigma",
@@ -179,7 +183,9 @@ def _patch_embed(self, x: GPUArray) -> GPUArray:
179183
# Add 2D sinusoidal positional embedding
180184
pos_embed = get_2d_sincos_pos_embed(self.hidden_size, (h_patches, w_patches))
181185
x_proj_np = x_proj.to_numpy()
182-
x_proj_np = x_proj_np + pos_embed[None, :, :] # [1, num_patches, D] broadcast to [B, num_patches, D]
186+
x_proj_np = (
187+
x_proj_np + pos_embed[None, :, :]
188+
) # [1, num_patches, D] broadcast to [B, num_patches, D]
183189

184190
return from_numpy(x_proj_np.astype(np.float32))
185191

@@ -325,8 +331,15 @@ def _self_attention(self, x: GPUArray, layer_idx: int) -> GPUArray:
325331
return x
326332

327333
return self_attention(
328-
x, q_w, k_w, v_w, out_w,
329-
q_b, k_b, v_b, out_b,
334+
x,
335+
q_w,
336+
k_w,
337+
v_w,
338+
out_w,
339+
q_b,
340+
k_b,
341+
v_b,
342+
out_b,
330343
num_heads=self.num_heads,
331344
)
332345

@@ -348,8 +361,16 @@ def _cross_attention(self, x: GPUArray, context: GPUArray, layer_idx: int) -> GP
348361
return from_numpy(np.zeros_like(x.to_numpy()))
349362

350363
return cross_attention(
351-
x, context, q_w, k_w, v_w, out_w,
352-
q_b, k_b, v_b, out_b,
364+
x,
365+
context,
366+
q_w,
367+
k_w,
368+
v_w,
369+
out_w,
370+
q_b,
371+
k_b,
372+
v_b,
373+
out_b,
353374
num_heads=self.num_heads,
354375
)
355376

@@ -398,7 +419,9 @@ def _final_layer(self, x: GPUArray, t_emb: GPUArray, H: int, W: int) -> GPUArray
398419

399420
if proj_w is not None:
400421
return unpatchify(
401-
x, H, W,
422+
x,
423+
H,
424+
W,
402425
out_channels=self.spec.out_channels,
403426
patch_size=self.patch_size,
404427
proj_weight=proj_w,

src/pygpukit/diffusion/models/flux/blocks.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@ def joint_block(
161161
Returns:
162162
Tuple of (image_output, text_output).
163163
"""
164+
164165
# Get weights helper
165166
def get_weight(name: str) -> GPUArray | None:
166167
return weights.get(f"{prefix}.{name}")

src/pygpukit/diffusion/models/flux/model.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,9 @@ def forward(
270270
# [B, txt_seq_len, 4096] -> [B, txt_seq_len, hidden_size]
271271
txt_2d = encoder_hidden_states.reshape(B * txt_seq_len, self.config.joint_attention_dim)
272272
txt = gpu_linear(
273-
txt_2d, self.weights["context_embedder.weight"], self.weights.get("context_embedder.bias")
273+
txt_2d,
274+
self.weights["context_embedder.weight"],
275+
self.weights.get("context_embedder.bias"),
274276
)
275277
txt = txt.reshape(B, txt_seq_len, self.config.hidden_size)
276278

0 commit comments

Comments
 (0)