Skip to content

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime#42

Open
loci-dev wants to merge 13 commits intomainfrom
loci/pr-1233-lokr-forward
Open

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime#42
loci-dev wants to merge 13 commits intomainfrom
loci/pr-1233-lokr-forward

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1233

Tested with https://civitai.green/models/344873/plana-blue-archivelokr

Hip (ROCm 6.2)

Master:

512x512 1024x1024 1024x1536
Unet compute buffer 3 295.83MB 3 993.52MB 4 860.33MB
Average time per step 0.89s 2.14s 3.2s

PR:

512x512 1024x1024 1024x1536
Unet compute buffer 137.05MB 830.86MB 1 701.55MB
Average time per step 1.3s 4.8s 7.44s

Vulkan (AMD propreitary driver):

###Master:

512x512 1024x1024 1024x1536
Unet compute buffer 3 363.80MB 4056.49MB 4986.61MB
Average time per step 1.02s 2.38s 3.57s

PR:

512x512 1024x1024 1024x1536
Unet compute buffer 137.05MB 830.86MB 1 746.55MB
Average time per step 0.92s 2.98s 4.5s

TLDR: significant VRAM savings across the board. Somehow a big performance hit across all resolutions on ROCm backend (that needs some more investigation), Vulkan backend going faster at smaller resolutions, but slower at high res.

EDIT:

Not long after doing these measurments I found a way to massively reduce the performance gap (with the same compute buffer size). It's now consistently better than master on Vulkan,
I'm too lazy to do the all tests again for now, but for example,1024x1536 now takes 3.84s per step on ROCm, and 3.07s on Vulkan

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of stable-diffusion.cpp compared 48,132 total functions across two binaries, identifying 72 modified, 38 new, and 35 removed functions. Power consumption increased minimally: build.bin.sd-cli (+0.074%, +356.22 nJ) and build.bin.sd-server (+0.105%, +539.41 nJ).

The performance profile shows intentional feature-driven changes rather than unexpected degradations. The dominant factor is LoKr (Kronecker Product LoRA) support added to enable more parameter-efficient model adaptations, accounting for the majority of performance changes.

Function Analysis

LoraModel::get_out_diff (both binaries) experienced the most significant changes:

  • sd-server: Response time increased from 269,665ns to 476,494ns (+76.7%, +206,829ns); throughput time increased from 1,723ns to 3,860ns (+124.0%, +2,137ns)
  • sd-cli: Response time increased from 270,505ns to 477,821ns (+76.6%, +207,316ns); throughput time increased from 1,725ns to 3,871ns (+124.4%, +2,146ns)

These regressions result from 115 lines of new code implementing LoKr support, including detection logic for 6 weight tensors (vs. 3 for standard LoRA), type casting for Conv2D operations, rank-based scaling, and ggml_ext_lokr_forward() calls. The function executes during every forward pass of adapted layers, making it performance-critical. However, the overhead is justified by the added functionality, and an early exit mechanism ensures standard LoRA operations remain unaffected when LoKr isn't used.

ggml_e8m0_to_fp32_half (sd-cli) showed significant improvement: response time decreased from 153ns to 118ns (-23.0%, -35ns); throughput time decreased from 145ns to 110ns (-24.3%, -35ns). This quantization conversion function benefits all inference workloads through more efficient dequantization operations.

std::vector<ggml_tensor> copy constructor* (sd-server) improved: response time decreased from 991ns to 917ns (-7.5%, -74ns); throughput time decreased from 219ns to 145ns (-33.9%, -74ns), benefiting tensor management throughout the codebase.

std::_Hashtable::_M_insert_unique_node (sd-server) improved: response time decreased from 1,489ns to 1,473ns (-1.1%, -16ns); throughput time decreased from 157ns to 140ns (-10.4%, -16ns), accelerating LoRA weight lookups.

Multiple STL functions showed regressions (std::vector::begin +214% response time, std::vector::operator[] +135% throughput time, hashtable allocation +61.8% throughput time, shared_ptr operations +103% throughput time), but these stem from compiler/library version differences rather than application code changes. Their absolute impacts are minimal (7-181ns) and occur in non-critical paths.

Additional Findings

The commit history reveals focused development on LoKr optimization across 9 commits, with explicit cross-platform fixes for CUDA, HIP, CPU, and Vulkan backends. The LoKr implementation trades increased CPU-side processing time for more parameter-efficient GPU adaptations using Kronecker products, which can provide better memory bandwidth and computational efficiency for model fine-tuning. The quantization improvements (23% faster e8m0 conversion) partially offset LoKr overhead, and the minimal power consumption increase (<0.11%) confirms the changes maintain energy efficiency suitable for edge-device deployment.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 67ccc74 to 73f4b3e Compare February 2, 2026 15:21
@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of 48,124 functions across two binaries reveals moderate performance impact from LoKR (Kronecker product-based LoRA) implementation. 65 functions modified (0.14%), 30 new, 35 removed, 47,994 unchanged. Power consumption increased negligibly: build.bin.sd-cli (+0.071%, 479,167→479,506 nJ) and build.bin.sd-server (+0.074%, 512,977→513,357 nJ).

Function Analysis

LoraModel::get_out_diff (both binaries) shows the primary impact:

  • sd-cli: Response time +202,431ns (+74.83%), throughput time +1,791ns (+103.78%)
  • sd-server: Response time +199,809ns (+74.10%), throughput time +1,792ns (+104.03%)

Source changes added ~100 lines implementing LokR support: detection of LokR weights, loading 6 tensors (vs 3 for standard LoRA), F16 type casting for conv2d, runtime rank computation, and specialized ggml_ext_lokr_forward() calls. The regression is justified as intentional feature enhancement with early-exit optimization for non-LokR workloads.

Positive changes: ggml_e8m0_to_fp32_half (sd-cli) improved -35ns (-23%) from GGML upstream optimization; std::vector<ggml_tensor*> copy constructor (sd-server) improved -74ns (-34%) from compiler optimization; apply_unary_op<hardsigmoid> (sd-server) improved -71ns (-9%).

Minor regressions: Hash table operations (+40ns, +54% throughput) and other standard library functions show small increases from build environment differences rather than code changes.

Additional Findings

The implementation includes comprehensive multi-backend support (CPU, CUDA, HIP, Vulkan) with specific Vulkan workgroup optimizations. Real-world impact: ~2.5% end-to-end latency increase for LokR-based inference, <1% for standard workflows. The 12-commit development pattern demonstrates careful iterative refinement with attention to correctness, backend compatibility, and performance optimization.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of stable-diffusion.cpp across 13 commits implementing LoKR (Low-Rank Kronecker product) support for LoRA operations. Total functions analyzed: 48,124 (65 modified, 30 new, 35 removed, 47,994 unchanged).

Binaries analyzed:

  • build.bin.sd-cli: +0.07% power consumption (479,167 → 479,505 nJ)
  • build.bin.sd-server: +0.074% power consumption (512,977 → 513,357 nJ)

Function Analysis

LoraModel::get_out_diff (both binaries): Primary impact from 106 lines of new LoKR functionality. Response time increased +74% (270µs → 473µs in sd-cli, 270µs → 469µs in sd-server). Throughput time doubled +104% (1.7µs → 3.5µs). Changes add LoKR tensor detection, loading of 6 weight matrices (vs 2 for standard LoRA), F16 type casting for Conv2D operations, and Kronecker product computation via new ggml_ext_lokr_forward function. Regression justified by expanded model adaptation capabilities.

Hash table operations (sd-cli): _M_bucket_index throughput time +54% (+40ns), _M_deallocate_buckets +39% (+37ns). Increased call frequency from LoKR's 7+ tensor lookups per layer (vs 2-3 for standard LoRA) amplifies per-operation overhead. Absolute impact minimal (~40ns per call).

Quantization functions (sd-cli): ggml_e8m0_to_fp32_half improved -24% (-35ns), beneficial for model loading. validate_float regressed +9% (+13ns), acceptable for correctness validation.

Memory management (sd-server): Vector copy constructor improved -34% throughput time (-74ns). make_shared<PhiloxRNG> throughput +113% (+83ns) but called only twice during initialization. _M_insert_unique_node improved -10% (-16ns) despite increased usage.

Unary operations: Mixed results with apply_unary_op showing +10% regression for negation (sd-cli) and -9% improvement for hardsigmoid (sd-server), both ~71ns changes.

Other analyzed functions showed negligible changes.

Additional Findings

The LoKR implementation demonstrates careful multi-backend optimization (Vulkan, CUDA, HIP, CPU) with 4 commits addressing Vulkan compute workgroup limitations. The 200µs per-layer CPU overhead is negligible compared to millisecond-scale GPU operations dominating inference. For typical models with 20-50 LoRA layers, total overhead is 4-10ms in 5-30 second generation workflows (<0.2% impact). Changes enable parameter-efficient model adaptation while maintaining <0.1% power consumption increase.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 8e873b7 to a3c6fdc Compare February 3, 2026 03:08
@loci-dev loci-dev force-pushed the main branch 12 times, most recently from a234621 to d762b55 Compare February 5, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants