UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime by loci-dev · Pull Request #42 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-02T11:44:07Z

Note

Source pull request: leejet/stable-diffusion.cpp#1233

Tested with https://civitai.green/models/344873/plana-blue-archivelokr

Hip (ROCm 6.2)

Master:

	512x512	1024x1024	1024x1536
Unet compute buffer	3 295.83MB	3 993.52MB	4 860.33MB
Average time per step	0.89s	2.14s	3.2s

PR:

	512x512	1024x1024	1024x1536
Unet compute buffer	137.05MB	830.86MB	1 701.55MB
Average time per step	1.3s	4.8s	7.44s

Vulkan (AMD propreitary driver):

###Master:

	512x512	1024x1024	1024x1536
Unet compute buffer	3 363.80MB	4056.49MB	4986.61MB
Average time per step	1.02s	2.38s	3.57s

PR:

	512x512	1024x1024	1024x1536
Unet compute buffer	137.05MB	830.86MB	1 746.55MB
Average time per step	0.92s	2.98s	4.5s

TLDR: significant VRAM savings across the board. Somehow a big performance hit across all resolutions on ROCm backend (that needs some more investigation), Vulkan backend going faster at smaller resolutions, but slower at high res.

EDIT:

Not long after doing these measurments I found a way to massively reduce the performance gap (with the same compute buffer size). It's now consistently better than master on Vulkan,
I'm too lazy to do the all tests again for now, but for example,1024x1536 now takes 3.84s per step on ROCm, and 3.07s on Vulkan

loci-review · 2026-02-02T12:32:40Z

Overview

Analysis of stable-diffusion.cpp compared 48,132 total functions across two binaries, identifying 72 modified, 38 new, and 35 removed functions. Power consumption increased minimally: build.bin.sd-cli (+0.074%, +356.22 nJ) and build.bin.sd-server (+0.105%, +539.41 nJ).

The performance profile shows intentional feature-driven changes rather than unexpected degradations. The dominant factor is LoKr (Kronecker Product LoRA) support added to enable more parameter-efficient model adaptations, accounting for the majority of performance changes.

Function Analysis

LoraModel::get_out_diff (both binaries) experienced the most significant changes:

sd-server: Response time increased from 269,665ns to 476,494ns (+76.7%, +206,829ns); throughput time increased from 1,723ns to 3,860ns (+124.0%, +2,137ns)
sd-cli: Response time increased from 270,505ns to 477,821ns (+76.6%, +207,316ns); throughput time increased from 1,725ns to 3,871ns (+124.4%, +2,146ns)

These regressions result from 115 lines of new code implementing LoKr support, including detection logic for 6 weight tensors (vs. 3 for standard LoRA), type casting for Conv2D operations, rank-based scaling, and ggml_ext_lokr_forward() calls. The function executes during every forward pass of adapted layers, making it performance-critical. However, the overhead is justified by the added functionality, and an early exit mechanism ensures standard LoRA operations remain unaffected when LoKr isn't used.

ggml_e8m0_to_fp32_half (sd-cli) showed significant improvement: response time decreased from 153ns to 118ns (-23.0%, -35ns); throughput time decreased from 145ns to 110ns (-24.3%, -35ns). This quantization conversion function benefits all inference workloads through more efficient dequantization operations.

std::vector<ggml_tensor> copy constructor* (sd-server) improved: response time decreased from 991ns to 917ns (-7.5%, -74ns); throughput time decreased from 219ns to 145ns (-33.9%, -74ns), benefiting tensor management throughout the codebase.

std::_Hashtable::_M_insert_unique_node (sd-server) improved: response time decreased from 1,489ns to 1,473ns (-1.1%, -16ns); throughput time decreased from 157ns to 140ns (-10.4%, -16ns), accelerating LoRA weight lookups.

Multiple STL functions showed regressions (std::vector::begin +214% response time, std::vector::operator[] +135% throughput time, hashtable allocation +61.8% throughput time, shared_ptr operations +103% throughput time), but these stem from compiler/library version differences rather than application code changes. Their absolute impacts are minimal (7-181ns) and occur in non-critical paths.

Additional Findings

The commit history reveals focused development on LoKr optimization across 9 commits, with explicit cross-platform fixes for CUDA, HIP, CPU, and Vulkan backends. The LoKr implementation trades increased CPU-side processing time for more parameter-efficient GPU adaptations using Kronecker products, which can provide better memory bandwidth and computational efficiency for model fine-tuning. The quantization improvements (23% faster e8m0 conversion) partially offset LoKr overhead, and the minimal power consumption increase (<0.11%) confirms the changes maintain energy efficiency suitable for edge-device deployment.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-02T17:50:26Z

Overview

Analysis of 48,124 functions across two binaries reveals moderate performance impact from LoKR (Kronecker product-based LoRA) implementation. 65 functions modified (0.14%), 30 new, 35 removed, 47,994 unchanged. Power consumption increased negligibly: build.bin.sd-cli (+0.071%, 479,167→479,506 nJ) and build.bin.sd-server (+0.074%, 512,977→513,357 nJ).

Function Analysis

LoraModel::get_out_diff (both binaries) shows the primary impact:

sd-cli: Response time +202,431ns (+74.83%), throughput time +1,791ns (+103.78%)
sd-server: Response time +199,809ns (+74.10%), throughput time +1,792ns (+104.03%)

Source changes added ~100 lines implementing LokR support: detection of LokR weights, loading 6 tensors (vs 3 for standard LoRA), F16 type casting for conv2d, runtime rank computation, and specialized ggml_ext_lokr_forward() calls. The regression is justified as intentional feature enhancement with early-exit optimization for non-LokR workloads.

Positive changes: ggml_e8m0_to_fp32_half (sd-cli) improved -35ns (-23%) from GGML upstream optimization; std::vector<ggml_tensor*> copy constructor (sd-server) improved -74ns (-34%) from compiler optimization; apply_unary_op<hardsigmoid> (sd-server) improved -71ns (-9%).

Minor regressions: Hash table operations (+40ns, +54% throughput) and other standard library functions show small increases from build environment differences rather than code changes.

Additional Findings

The implementation includes comprehensive multi-backend support (CPU, CUDA, HIP, Vulkan) with specific Vulkan workgroup optimizations. Real-world impact: ~2.5% end-to-end latency increase for LokR-based inference, <1% for standard workflows. The 12-commit development pattern demonstrates careful iterative refinement with attention to correctness, backend compatibility, and performance optimization.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-02T18:51:40Z

Overview

Analysis of stable-diffusion.cpp across 13 commits implementing LoKR (Low-Rank Kronecker product) support for LoRA operations. Total functions analyzed: 48,124 (65 modified, 30 new, 35 removed, 47,994 unchanged).

Binaries analyzed:

build.bin.sd-cli: +0.07% power consumption (479,167 → 479,505 nJ)
build.bin.sd-server: +0.074% power consumption (512,977 → 513,357 nJ)

Function Analysis

LoraModel::get_out_diff (both binaries): Primary impact from 106 lines of new LoKR functionality. Response time increased +74% (270µs → 473µs in sd-cli, 270µs → 469µs in sd-server). Throughput time doubled +104% (1.7µs → 3.5µs). Changes add LoKR tensor detection, loading of 6 weight matrices (vs 2 for standard LoRA), F16 type casting for Conv2D operations, and Kronecker product computation via new ggml_ext_lokr_forward function. Regression justified by expanded model adaptation capabilities.

Hash table operations (sd-cli): _M_bucket_index throughput time +54% (+40ns), _M_deallocate_buckets +39% (+37ns). Increased call frequency from LoKR's 7+ tensor lookups per layer (vs 2-3 for standard LoRA) amplifies per-operation overhead. Absolute impact minimal (~40ns per call).

Quantization functions (sd-cli): ggml_e8m0_to_fp32_half improved -24% (-35ns), beneficial for model loading. validate_float regressed +9% (+13ns), acceptable for correctness validation.

Memory management (sd-server): Vector copy constructor improved -34% throughput time (-74ns). make_shared<PhiloxRNG> throughput +113% (+83ns) but called only twice during initialization. _M_insert_unique_node improved -10% (-16ns) despite increased usage.

Unary operations: Mixed results with apply_unary_op showing +10% regression for negation (sd-cli) and -9% improvement for hardsigmoid (sd-server), both ~71ns changes.

Other analyzed functions showed negligible changes.

Additional Findings

The LoKR implementation demonstrates careful multi-backend optimization (Vulkan, CUDA, HIP, CPU) with 4 commits addressing Vulkan compute workgroup limitations. The 200µs per-layer CPU overhead is negligible compared to millisecond-scale GPU operations dominating inference. For typical models with 20-50 LoRA layers, total overhead is 4-10ms in 5-30 second generation workflows (<0.2% impact). Changes enable parameter-efficient model adaptation while maintaining <0.1% power consumption increase.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

stduhpf added 9 commits February 1, 2026 17:53

LoRA: Optimise LoKr at runtime

b4db4be

lokr: fix convs

d608b37

lokr: fix lienar forward for CUDA/HIP and CPU backends

b486097

lokr: disable "optimization" for convolutions

8553862

LoKR: re-implement conv

2430989

lokr: fix conv bypass implementation

fbf401b

lokr: cleanup linear path code

04f9b1f

reshape to 2d before mat_mul

5b67c4b

maxComputeWorkGroupCount workaround for vulkan

f7d53b6

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 2, 2026 11:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 473a170 to 32e2075 Compare February 2, 2026 12:21

loci-dev force-pushed the main branch 3 times, most recently from 67ccc74 to 73f4b3e Compare February 2, 2026 15:21

stduhpf added 3 commits February 2, 2026 17:13

Avoid too large tensors dims in matmul for smaller vk workgroups

244480e

make it vk only

1ab9ed2

remove unncesary casts for non-conv weights

c7629d9

loci-dev force-pushed the main branch from 73f4b3e to 818be20 Compare February 2, 2026 16:14

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 2, 2026 16:46 — with GitHub Actions Inactive

fix wrong flag (oops)

30051a2

loci-dev force-pushed the main branch from 818be20 to 457d775 Compare February 2, 2026 17:20

loci-dev temporarily deployed to stable-diffusion-cpp-prod February 2, 2026 17:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 457d775 to cbb7447 Compare February 2, 2026 18:20

loci-dev force-pushed the main branch 4 times, most recently from 8e873b7 to a3c6fdc Compare February 3, 2026 03:08

loci-dev force-pushed the main branch 12 times, most recently from a234621 to d762b55 Compare February 5, 2026 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime#42

UPSTREAM PR #1233: LoRA: Optimise LoKr at runtime#42
loci-dev wants to merge 13 commits intomainfrom
loci/pr-1233-lokr-forward

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Hip (ROCm 6.2)

Master:

PR:

Vulkan (AMD propreitary driver):

PR:

EDIT:

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants