Release v0.2.19 (#188)

m96-chan · claude · web-flow · commit 7adbe5f28546 · 2026-01-02T04:24:25.000+09:00
* feat: add cuBLAS dynamic loader and C++ kernel profiler (#134, #150) ## cuBLAS Dynamic Loader (Issue #134) - Dynamic loading of cuBLAS library (cublas64_13.dll / libcublas.so) - Supports GEMM: sgemm, dgemm, hgemm, gemm_ex (mixed precision) - Supports GEMV: sgemv, dgemv - Row-major convenience wrappers for Python API - Python bindings: cublas_is_available, cublas_get_version, cublas_test_* ## C++ Kernel Profiler (Issue #150) - Native C++ profiler using CUDA Driver API (cuEvent*) - ScopedTimer class for RAII-based timing - KernelProfiler for aggregating multiple kernel records - Python bindings with automatic native backend detection - Chrome trace export support Test results (RTX 5090, CUDA 13.1): - cuBLAS loaded: cublas64_13.dll v13.2.0 - SGEMM/HGEMM/DGEMM: all pass - Profiler: native C++ backend active 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(llm): add lazy model loading with streaming strategies (#159) Add memory-mapped model loading with on-demand GPU loading for large models (70B+). ## Core Implementation (Rust) - LazyTensor: GPU caching with LRU eviction - LazyModelLoader: Multi-file SafeTensors loader with memory budgeting - TensorState enum: OnDisk, Loading, OnGpu, Evicted - Layer management: get_layer_tensors, layer_size, is_layer_loaded, layer_state ## Loading Strategies (Python) - SimpleStreaming: Load/unload each layer (minimal VRAM) - SlidingWindow: Keep N layers, prefetch ahead (balanced) - AutoLRU: Automatic LRU eviction (best performance) ## API - LazyModelLoader(memory_budget, enable_eviction) - LayerStreamingContext for managed streaming - create_streaming_context() factory function ## Usage ```python loader = LazyModelLoader(memory_budget=8 * 1024**3) loader.load_file("model.safetensors") with LayerStreamingContext(loader, SlidingWindow(4), num_layers=32) as ctx: for i in range(32): ctx.prepare(i) hidden = layers[i](hidden) ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): resolve ruff B027 and UP037 errors in streaming.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * style: apply ruff format to streaming.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): resolve ruff errors in profiling module - Remove unused imports (F401) - Fix f-string without placeholders (F541) - Organize imports (I001) - Remove unnecessary mode argument (UP015) - Fix redefinition of unused import (F811) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(tests): add skip markers for profiling tests requiring CUDA Tests that require native CUDA module are now skipped when running in CI environment without GPU support. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add image generation module for SD3, Flux, PixArt (#177) Implements complete diffusion model support for text-to-image generation: Models: - DiT (Diffusion Transformer) with AdaLN conditioning - SD3Transformer (MMDiT architecture) - FluxTransformer with guidance embedding - VAE encoder/decoder with SafeTensors loading Schedulers: - EulerDiscreteScheduler (SDXL-style) - DDIMScheduler (deterministic/stochastic) - FlowMatchingScheduler (Rectified Flow for SD3/Flux) Operations: - GroupNorm (CPU fallback) - Cross-Attention (non-causal) - Conv2D / Conv2DTranspose (im2col) - AdaLN / AdaLN-Zero - Sinusoidal timestep embedding Text Encoders: - CLIPTextEncoder (OpenCLIP-style) - T5Encoder (T5-XXL for SD3/Flux) Pipeline: - Text2ImagePipeline with unified interface - Demo mode (works without model weights) - Batch generation support Example: - examples/image_generate.py with CLI interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(diffusion): resolve mypy type errors in text encoders Fix variable shadowing issue where input_ids/attention_mask were first defined as lists then reassigned to numpy arrays, confusing mypy. - Add explicit type annotations for input_ids and attention_mask - Rename intermediate list variables to ids_list and mask_list 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add native CUDA kernels for image generation ops Implement CUDA kernels for diffusion model operations: - GroupNorm: F32/BF16/FP16 variants for VAE/UNet - AdaLN/AdaLN-Zero: Adaptive Layer Normalization for DiT - Cross-Attention: Non-causal attention for text-to-image - Conv2D: im2col, col2im, 1x1 and 3x3 direct convolutions Files added: - native/ops/nn/diffusion/: groupnorm, adaln, cross_attention, conv2d kernels - native/bindings/nn/diffusion.cpp: pybind11 bindings Python ops updated to use native kernels when available: - group_norm.py, adaln.py, cross_attention.py, conv2d.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(diffusion): fix PixArt-Sigma model loading and inference - Fix out_channels from 4 to 8 for PixArt-Sigma (noise + variance) - Add transformer subdirectory detection for HuggingFace diffusers format - Add sharded T5 encoder detection with fallback to random embeddings - Extract first 4 channels from 8-channel noise prediction Tested with PixArt-Sigma-XL-2-512-MS: - 10 steps in 24.49s (2.449s/step) - Output: output/pixart_test.png 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add HuggingFace T5 encoder with sharded safetensors support - Add HFT5Encoder class using transformers library for proper T5 encoding - Support sharded safetensors loading via Python safetensors library - Auto-detect tokenizer in parent/tokenizer directory - CPU fallback when PyTorch doesn't support GPU (e.g., RTX 5090) - Update pipeline to prefer HFT5Encoder over simple T5Encoder Tested with PixArt-Sigma + T5-XXL: - T5 encoder on CPU (PyTorch lacks SM120 support) - Diffusion model on GPU via PyGPUkit - 20 steps in 55.9s (2.795s/step) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add batched_matmul loop fallback for SM120 - Add _batched_matmul_loop() for when CUTLASS fails (SM120) - Use batched_matmul in T5 self-attention (80s -> 30s) - Remove HFT5Encoder (PyTorch dependency) - T5 now uses native GPU matmul operations Performance (RTX 5090, SM120): - T5-XXL encoding: 80s -> 30s (2.7x speedup) - batched_matmul [64,512,64]@[64,64,512]: 45ms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat(diffusion): add FLUX.1 transformer implementation Implements FLUX.1-schnell text-to-image generation: - FluxTransformer with 19 joint + 38 single blocks - Joint attention (image-text cross-attention) - Single attention (self-attention on concatenated sequence) - Flow matching Euler scheduler - GPU-native ops for linear, transpose, matmul, softmax Optimizations: - GPU-native transpose_4d_0213 (18x faster than numpy) - GPU-native transpose_3d_012 for K^T (22x faster) - RoPE frequency caching to avoid recomputation Known limitations: - Modulation, layer_norm, gated_residual use numpy fallback - Generation time ~420s (vs ~3s diffusers) - needs broadcast kernels 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(lint): remove unused variables in DiT and FLUX models - Remove unused N variable in dit/model.py - Fix unused conditioning variable in dit/adaln.py - Remove unused imports in flux/blocks.py - Remove unused x_np in flux/model.py - Add DiT transformer components (PixArt architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): remove orphaned #endif in diffusion kernels The files use #pragma once but had orphaned #endif statements causing compilation errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): use nbytes() instead of size_bytes() in diffusion.inl GPUArray uses nbytes() method, not size_bytes(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cmake): use device_memset wrapper instead of cudaMemset Use the project's device_memset wrapper for CUDA API abstraction. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update README for v0.2.19 - Add FLUX.1 image generation section - Add DiT architecture support documentation - Add new GPU operations for diffusion - Update roadmap with v0.2.19 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: expand v0.2.19 release notes Add missing features: - Lazy model loading with streaming strategies - cuBLAS dynamic loader - C++ kernel profiler - HuggingFace T5 encoder support - Additional GPU operations (cross_attention, conv2d, group_norm) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: bump version to 0.2.19 - pyproject.toml: 0.2.18 -> 0.2.19 - benchmark/results.py: 0.2.18 -> 0.2.19 - Apply ruff format to diffusion modules 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -99,6 +99,114 @@ They were all observed in production or real benchmarks.
 
 ---
 
+## What's New in v0.2.19
+
+### FLUX.1 Image Generation
+Text-to-image generation with Black Forest Labs' FLUX.1 model:
+
+```python
+from pygpukit.diffusion import FluxPipeline
+
+# Load FLUX.1-schnell (fast variant)
+pipeline = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell")
+
+# Generate image
+image = pipeline.generate(
+    prompt="a photo of a cat sitting on a windowsill",
+    num_inference_steps=4,  # schnell uses few steps
+    guidance_scale=0.0,     # schnell doesn't use CFG
+)
+image.save("output.png")
+```
+
+| Component | Description |
+|-----------|-------------|
+| **FluxTransformer** | 19 joint blocks + 38 single blocks |
+| **FluxScheduler** | Flow matching Euler scheduler |
+| **GPU-native ops** | Transpose, batched matmul, RoPE on GPU |
+| **RoPE frequencies** | Cached on GPU for efficient reuse |
+
+### Lazy Model Loading with Streaming
+Memory-efficient model loading strategies for large models:
+
+```python
+from pygpukit.llm import QwenModel, StreamingStrategy
+
+# Progressive loading - load layers as needed
+model = QwenModel.from_safetensors(
+    "path/to/model",
+    streaming=StreamingStrategy.PROGRESSIVE
+)
+
+# Layer-by-layer streaming for memory-constrained environments
+model = QwenModel.from_safetensors(
+    "path/to/model",
+    streaming=StreamingStrategy.LAYER_BY_LAYER
+)
+```
+
+| Strategy | Description |
+|----------|-------------|
+| `EAGER` | Load all weights at once (default) |
+| `PROGRESSIVE` | Load weights progressively during first forward |
+| `LAYER_BY_LAYER` | Stream one layer at a time, minimal memory |
+
+### cuBLAS Dynamic Loader
+Runtime cuBLAS/cuBLASLt loading without compile-time CUDA Toolkit dependency:
+
+| Feature | Description |
+|---------|-------------|
+| **Dynamic DLL loading** | Searches CUDA_PATH, system PATH |
+| **Version detection** | Auto-selects cublasLt64_13/12/11.dll |
+| **Graceful fallback** | Uses native kernels if cuBLAS unavailable |
+
+### C++ Kernel Profiler
+Built-in CUDA kernel profiling with minimal overhead:
+
+```python
+from pygpukit import enable_profiling, get_profile_stats
+
+enable_profiling(True)
+# ... run your code ...
+stats = get_profile_stats()
+for name, info in stats.items():
+    print(f"{name}: {info['avg_ms']:.3f} ms ({info['count']} calls)")
+```
+
+### HuggingFace T5 Encoder Support
+T5 text encoder with sharded safetensors for FLUX/SD3:
+
+| Feature | Description |
+|---------|-------------|
+| **Sharded loading** | Supports `model-00001-of-00002.safetensors` format |
+| **T5EncoderModel** | Full T5 encoder implementation |
+| **Automatic detection** | Finds encoder in model directories |
+
+### DiT Architecture Support
+Diffusion Transformer (DiT) components for PixArt and similar models:
+
+| Module | Description |
+|--------|-------------|
+| `dit/model.py` | PixArt transformer with AdaLN-Zero |
+| `dit/attention.py` | Self/cross attention with GQA |
+| `dit/embeddings.py` | Patch embed, timestep embed, 2D sincos pos |
+| `dit/adaln.py` | Adaptive LayerNorm modulation |
+| `dit/ffn.py` | GEGLU feed-forward network |
+
+### New GPU Operations
+| Operation | Description |
+|-----------|-------------|
+| `transpose_4d_0213` | GPU-native 4D transpose [B,S,H,D] -> [B,H,S,D] |
+| `transpose_3d_012` | GPU-native 3D transpose [B,S,D] -> [B,D,S] |
+| `gpu_batched_matmul` | Batched matrix multiplication |
+| `gpu_softmax` | GPU-native softmax |
+| `gpu_apply_rope` | Apply rotary position embedding |
+| `cross_attention` | Cross-attention for text conditioning |
+| `conv2d` | 2D convolution for VAE/UNet |
+| `group_norm` | Group normalization |
+
+---
+
 ## What's New in v0.2.18
 
 ### Major Codebase Refactoring
@@ -595,6 +703,7 @@ PyGPUkit/
 | **v0.2.16** | **MoE support** (Mixtral), Thinking models (Qwen3), W8A8/W4A4 GEMV, W8A16/Int8/Int4 GEMM, Kernel restructure |
 | **v0.2.17** | **Triton backend** MVP, hybrid execution (Triton + Native CUDA), TritonArray wrapper |
 | **v0.2.18** | **Codebase refactoring**, Kokoro TTS, Positional encoding (PoPE/ALiBi/YaRN/NTK), ReLU², Unified benchmark, BF16 GEMV (98% BW), W8A16 fix |
+| **v0.2.19** | **FLUX.1 image generation**, Lazy model loading (streaming), cuBLAS dynamic loader, C++ kernel profiler, T5 encoder, DiT architecture, GPU-native diffusion ops |
 
 ### Planned
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "scikit_build_core.build"
 
 [project]
 name = "PyGPUkit"
-version = "0.2.18"
+version = "0.2.19"
 description = "A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API"
 readme = "README.md"
 license = "MIT"
diff --git a/src/pygpukit/benchmark/results.py b/src/pygpukit/benchmark/results.py
@@ -57,7 +57,7 @@ class BenchmarkReport:
     gpu: GPUInfo
     results: list[BenchmarkResult] = field(default_factory=list)
     timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
-    version: str = "0.2.18"
+    version: str = "0.2.19"
 
     def add(self, result: BenchmarkResult) -> None:
         self.results.append(result)
diff --git a/src/pygpukit/diffusion/models/dit/embeddings.py b/src/pygpukit/diffusion/models/dit/embeddings.py
@@ -64,10 +64,10 @@ def get_2d_sincos_pos_embed(embed_dim: int, grid_size: int | tuple[int, int]) ->
 
     # Create 2D grid in column-major order (h varies first)
     # This matches diffusers: for each column, iterate through rows
-    h_grid, w_grid = np.meshgrid(grid_h_pos, grid_w_pos, indexing='ij')
+    h_grid, w_grid = np.meshgrid(grid_h_pos, grid_w_pos, indexing="ij")
     # Flatten in Fortran order (column-major) to match diffusers patch ordering
-    h_flat = h_grid.flatten('F')  # [H*W]
-    w_flat = w_grid.flatten('F')  # [H*W]
+    h_flat = h_grid.flatten("F")  # [H*W]
+    w_flat = w_grid.flatten("F")  # [H*W]
 
     # Get embeddings for each dimension
     emb_h = sinusoidal_embedding(h_flat, embed_dim // 2)  # height embedding
diff --git a/src/pygpukit/diffusion/models/dit/model.py b/src/pygpukit/diffusion/models/dit/model.py
@@ -98,7 +98,11 @@ def from_safetensors(
 
         # Detect spec from weights
         hidden_size = weights["pos_embed.proj.bias"].shape[0]
-        num_blocks = sum(1 for k in weights if k.startswith("transformer_blocks.") and k.endswith(".attn1.to_q.weight"))
+        num_blocks = sum(
+            1
+            for k in weights
+            if k.startswith("transformer_blocks.") and k.endswith(".attn1.to_q.weight")
+        )
 
         spec = PixArtSpec(
             name="pixart_sigma",
@@ -179,7 +183,9 @@ def _patch_embed(self, x: GPUArray) -> GPUArray:
         # Add 2D sinusoidal positional embedding
         pos_embed = get_2d_sincos_pos_embed(self.hidden_size, (h_patches, w_patches))
         x_proj_np = x_proj.to_numpy()
-        x_proj_np = x_proj_np + pos_embed[None, :, :]  # [1, num_patches, D] broadcast to [B, num_patches, D]
+        x_proj_np = (
+            x_proj_np + pos_embed[None, :, :]
+        )  # [1, num_patches, D] broadcast to [B, num_patches, D]
 
         return from_numpy(x_proj_np.astype(np.float32))
 
@@ -325,8 +331,15 @@ def _self_attention(self, x: GPUArray, layer_idx: int) -> GPUArray:
             return x
 
         return self_attention(
-            x, q_w, k_w, v_w, out_w,
-            q_b, k_b, v_b, out_b,
+            x,
+            q_w,
+            k_w,
+            v_w,
+            out_w,
+            q_b,
+            k_b,
+            v_b,
+            out_b,
             num_heads=self.num_heads,
         )
 
@@ -348,8 +361,16 @@ def _cross_attention(self, x: GPUArray, context: GPUArray, layer_idx: int) -> GP
             return from_numpy(np.zeros_like(x.to_numpy()))
 
         return cross_attention(
-            x, context, q_w, k_w, v_w, out_w,
-            q_b, k_b, v_b, out_b,
+            x,
+            context,
+            q_w,
+            k_w,
+            v_w,
+            out_w,
+            q_b,
+            k_b,
+            v_b,
+            out_b,
             num_heads=self.num_heads,
         )
 
@@ -398,7 +419,9 @@ def _final_layer(self, x: GPUArray, t_emb: GPUArray, H: int, W: int) -> GPUArray
 
         if proj_w is not None:
             return unpatchify(
-                x, H, W,
+                x,
+                H,
+                W,
                 out_channels=self.spec.out_channels,
                 patch_size=self.patch_size,
                 proj_weight=proj_w,
diff --git a/src/pygpukit/diffusion/models/flux/blocks.py b/src/pygpukit/diffusion/models/flux/blocks.py
@@ -161,6 +161,7 @@ def joint_block(
     Returns:
         Tuple of (image_output, text_output).
     """
+
     # Get weights helper
     def get_weight(name: str) -> GPUArray | None:
         return weights.get(f"{prefix}.{name}")
diff --git a/src/pygpukit/diffusion/models/flux/model.py b/src/pygpukit/diffusion/models/flux/model.py
@@ -270,7 +270,9 @@ def forward(
         # [B, txt_seq_len, 4096] -> [B, txt_seq_len, hidden_size]
         txt_2d = encoder_hidden_states.reshape(B * txt_seq_len, self.config.joint_attention_dim)
         txt = gpu_linear(
-            txt_2d, self.weights["context_embedder.weight"], self.weights.get("context_embedder.bias")
+            txt_2d,
+            self.weights["context_embedder.weight"],
+            self.weights.get("context_embedder.bias"),
         )
         txt = txt.reshape(B, txt_seq_len, self.config.hidden_size)