Commit 7adbe5f
Release v0.2.19 (#188)
* feat: add cuBLAS dynamic loader and C++ kernel profiler (#134, #150)
## cuBLAS Dynamic Loader (Issue #134)
- Dynamic loading of cuBLAS library (cublas64_13.dll / libcublas.so)
- Supports GEMM: sgemm, dgemm, hgemm, gemm_ex (mixed precision)
- Supports GEMV: sgemv, dgemv
- Row-major convenience wrappers for Python API
- Python bindings: cublas_is_available, cublas_get_version, cublas_test_*
## C++ Kernel Profiler (Issue #150)
- Native C++ profiler using CUDA Driver API (cuEvent*)
- ScopedTimer class for RAII-based timing
- KernelProfiler for aggregating multiple kernel records
- Python bindings with automatic native backend detection
- Chrome trace export support
Test results (RTX 5090, CUDA 13.1):
- cuBLAS loaded: cublas64_13.dll v13.2.0
- SGEMM/HGEMM/DGEMM: all pass
- Profiler: native C++ backend active
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(llm): add lazy model loading with streaming strategies (#159)
Add memory-mapped model loading with on-demand GPU loading for large models (70B+).
## Core Implementation (Rust)
- LazyTensor: GPU caching with LRU eviction
- LazyModelLoader: Multi-file SafeTensors loader with memory budgeting
- TensorState enum: OnDisk, Loading, OnGpu, Evicted
- Layer management: get_layer_tensors, layer_size, is_layer_loaded, layer_state
## Loading Strategies (Python)
- SimpleStreaming: Load/unload each layer (minimal VRAM)
- SlidingWindow: Keep N layers, prefetch ahead (balanced)
- AutoLRU: Automatic LRU eviction (best performance)
## API
- LazyModelLoader(memory_budget, enable_eviction)
- LayerStreamingContext for managed streaming
- create_streaming_context() factory function
## Usage
```python
loader = LazyModelLoader(memory_budget=8 * 1024**3)
loader.load_file("model.safetensors")
with LayerStreamingContext(loader, SlidingWindow(4), num_layers=32) as ctx:
for i in range(32):
ctx.prepare(i)
hidden = layers[i](hidden)
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(lint): resolve ruff B027 and UP037 errors in streaming.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* style: apply ruff format to streaming.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(lint): resolve ruff errors in profiling module
- Remove unused imports (F401)
- Fix f-string without placeholders (F541)
- Organize imports (I001)
- Remove unnecessary mode argument (UP015)
- Fix redefinition of unused import (F811)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(tests): add skip markers for profiling tests requiring CUDA
Tests that require native CUDA module are now skipped when running
in CI environment without GPU support.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(diffusion): add image generation module for SD3, Flux, PixArt (#177)
Implements complete diffusion model support for text-to-image generation:
Models:
- DiT (Diffusion Transformer) with AdaLN conditioning
- SD3Transformer (MMDiT architecture)
- FluxTransformer with guidance embedding
- VAE encoder/decoder with SafeTensors loading
Schedulers:
- EulerDiscreteScheduler (SDXL-style)
- DDIMScheduler (deterministic/stochastic)
- FlowMatchingScheduler (Rectified Flow for SD3/Flux)
Operations:
- GroupNorm (CPU fallback)
- Cross-Attention (non-causal)
- Conv2D / Conv2DTranspose (im2col)
- AdaLN / AdaLN-Zero
- Sinusoidal timestep embedding
Text Encoders:
- CLIPTextEncoder (OpenCLIP-style)
- T5Encoder (T5-XXL for SD3/Flux)
Pipeline:
- Text2ImagePipeline with unified interface
- Demo mode (works without model weights)
- Batch generation support
Example:
- examples/image_generate.py with CLI interface
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(diffusion): resolve mypy type errors in text encoders
Fix variable shadowing issue where input_ids/attention_mask were first
defined as lists then reassigned to numpy arrays, confusing mypy.
- Add explicit type annotations for input_ids and attention_mask
- Rename intermediate list variables to ids_list and mask_list
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(diffusion): add native CUDA kernels for image generation ops
Implement CUDA kernels for diffusion model operations:
- GroupNorm: F32/BF16/FP16 variants for VAE/UNet
- AdaLN/AdaLN-Zero: Adaptive Layer Normalization for DiT
- Cross-Attention: Non-causal attention for text-to-image
- Conv2D: im2col, col2im, 1x1 and 3x3 direct convolutions
Files added:
- native/ops/nn/diffusion/: groupnorm, adaln, cross_attention, conv2d kernels
- native/bindings/nn/diffusion.cpp: pybind11 bindings
Python ops updated to use native kernels when available:
- group_norm.py, adaln.py, cross_attention.py, conv2d.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(diffusion): fix PixArt-Sigma model loading and inference
- Fix out_channels from 4 to 8 for PixArt-Sigma (noise + variance)
- Add transformer subdirectory detection for HuggingFace diffusers format
- Add sharded T5 encoder detection with fallback to random embeddings
- Extract first 4 channels from 8-channel noise prediction
Tested with PixArt-Sigma-XL-2-512-MS:
- 10 steps in 24.49s (2.449s/step)
- Output: output/pixart_test.png
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(diffusion): add HuggingFace T5 encoder with sharded safetensors support
- Add HFT5Encoder class using transformers library for proper T5 encoding
- Support sharded safetensors loading via Python safetensors library
- Auto-detect tokenizer in parent/tokenizer directory
- CPU fallback when PyTorch doesn't support GPU (e.g., RTX 5090)
- Update pipeline to prefer HFT5Encoder over simple T5Encoder
Tested with PixArt-Sigma + T5-XXL:
- T5 encoder on CPU (PyTorch lacks SM120 support)
- Diffusion model on GPU via PyGPUkit
- 20 steps in 55.9s (2.795s/step)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(diffusion): add batched_matmul loop fallback for SM120
- Add _batched_matmul_loop() for when CUTLASS fails (SM120)
- Use batched_matmul in T5 self-attention (80s -> 30s)
- Remove HFT5Encoder (PyTorch dependency)
- T5 now uses native GPU matmul operations
Performance (RTX 5090, SM120):
- T5-XXL encoding: 80s -> 30s (2.7x speedup)
- batched_matmul [64,512,64]@[64,64,512]: 45ms
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(diffusion): add FLUX.1 transformer implementation
Implements FLUX.1-schnell text-to-image generation:
- FluxTransformer with 19 joint + 38 single blocks
- Joint attention (image-text cross-attention)
- Single attention (self-attention on concatenated sequence)
- Flow matching Euler scheduler
- GPU-native ops for linear, transpose, matmul, softmax
Optimizations:
- GPU-native transpose_4d_0213 (18x faster than numpy)
- GPU-native transpose_3d_012 for K^T (22x faster)
- RoPE frequency caching to avoid recomputation
Known limitations:
- Modulation, layer_norm, gated_residual use numpy fallback
- Generation time ~420s (vs ~3s diffusers) - needs broadcast kernels
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(lint): remove unused variables in DiT and FLUX models
- Remove unused N variable in dit/model.py
- Fix unused conditioning variable in dit/adaln.py
- Remove unused imports in flux/blocks.py
- Remove unused x_np in flux/model.py
- Add DiT transformer components (PixArt architecture)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(cmake): remove orphaned #endif in diffusion kernels
The files use #pragma once but had orphaned #endif statements
causing compilation errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(cmake): use nbytes() instead of size_bytes() in diffusion.inl
GPUArray uses nbytes() method, not size_bytes().
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(cmake): use device_memset wrapper instead of cudaMemset
Use the project's device_memset wrapper for CUDA API abstraction.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: update README for v0.2.19
- Add FLUX.1 image generation section
- Add DiT architecture support documentation
- Add new GPU operations for diffusion
- Update roadmap with v0.2.19
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: expand v0.2.19 release notes
Add missing features:
- Lazy model loading with streaming strategies
- cuBLAS dynamic loader
- C++ kernel profiler
- HuggingFace T5 encoder support
- Additional GPU operations (cross_attention, conv2d, group_norm)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* chore: bump version to 0.2.19
- pyproject.toml: 0.2.18 -> 0.2.19
- benchmark/results.py: 0.2.18 -> 0.2.19
- Apply ruff format to diffusion modules
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>1 parent 0b5c13e commit 7adbe5f
File tree
7 files changed
+148
-13
lines changed- src/pygpukit
- benchmark
- diffusion/models
- dit
- flux
7 files changed
+148
-13
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
102 | 210 | | |
103 | 211 | | |
104 | 212 | | |
| |||
595 | 703 | | |
596 | 704 | | |
597 | 705 | | |
| 706 | + | |
598 | 707 | | |
599 | 708 | | |
600 | 709 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | | - | |
70 | | - | |
| 69 | + | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
102 | 106 | | |
103 | 107 | | |
104 | 108 | | |
| |||
179 | 183 | | |
180 | 184 | | |
181 | 185 | | |
182 | | - | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
183 | 189 | | |
184 | 190 | | |
185 | 191 | | |
| |||
325 | 331 | | |
326 | 332 | | |
327 | 333 | | |
328 | | - | |
329 | | - | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
330 | 343 | | |
331 | 344 | | |
332 | 345 | | |
| |||
348 | 361 | | |
349 | 362 | | |
350 | 363 | | |
351 | | - | |
352 | | - | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
353 | 374 | | |
354 | 375 | | |
355 | 376 | | |
| |||
398 | 419 | | |
399 | 420 | | |
400 | 421 | | |
401 | | - | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
402 | 425 | | |
403 | 426 | | |
404 | 427 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
161 | 161 | | |
162 | 162 | | |
163 | 163 | | |
| 164 | + | |
164 | 165 | | |
165 | 166 | | |
166 | 167 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
270 | 270 | | |
271 | 271 | | |
272 | 272 | | |
273 | | - | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
274 | 276 | | |
275 | 277 | | |
276 | 278 | | |
| |||
0 commit comments