Add quantized fusion patterns for dense, QKV, and SDPA by ajroetker · Pull Request #60 · gomlx/onnx-gomlx

ajroetker · 2026-02-18T22:52:53Z

Summary

Adds ONNX fusion pattern detectors for quantized operations, built on the fusion framework from the add-fusion-support branch:

Quantized Dense: Detects DynamicQuantizeLinear → MatMulInteger → DequantizeLinear chains and emits nn.QuantizedDense with Int8 format (score: 40)
Quantized QKV: Merges three quantized dense Q/K/V projections into a single batched QuantizedDense call (score: 70)
Quantized SDPA: Detects full quantized attention pattern (DQL → MatMulInteger(Q,K^T) → Cast → Scale → Softmax → MatMulInteger(.,V)) and emits BackendFusedQuantizedScaledDotProductAttention (score: 90)
Fixes GQA head mismatch in convertGroupQueryAttention — replicates KV heads when kvNumHeads < numHeads before calling attention.Core
Adds isZeroInitializer helper and updates convertDequantizeLinear to take Model receiver

Dependencies

Requires the fusion framework from add-fusion-support branch (included in this PR)
Requires gomlx#350 (adds QuantFormat, FusedQuantizedDense, FusedQuantizedScaledDotProductAttention to gomlx backends)

Test plan

All existing fusion tests pass (SDPA, DenseGelu, QKVDense detection + integration)
All existing onnx op tests pass
TestGroupQueryAttention/GQA-basic now passes (was failing due to head count mismatch)
Full go build ./... compiles cleanly

…U patterns Introduces a graph fusion framework that detects common subgraph patterns (scaled dot-product attention, QKV dense projections, dense+GELU activations) and replaces them with fused ops when the backend supports them. Adds a capability-gated fused SDPA fast path in the MultiHeadAttention op converter.

# Conflicts: # go.mod # go.sum # onnx/ops.go

Bump github.com/gomlx/gomlx from v0.26.1-0.20260211111746-dd3d906b02a6 to v0.26.1-0.20260215082710-429182c8560c.

Replace FusionType enum, FusionGroup struct, and switch-based dispatch with a FusionCandidate interface and RegisterFusionDetector registration pattern. Each fusion (SDPA, QKVDense, DenseGelu) is now self-contained with its own candidate type, detector init(), and emit method. Detection uses score-based greedy selection for non-overlapping fusions. Backend capability checks are removed since the GoMLX wrapper functions (attention.Core, attention.QKVProjection, nn.Dense) handle fused-vs- decomposed fallback internally via InternalFusedOpCaller. Fixes 4 compiler errors from undefined symbols: - FusedMultiHeadSDPA → attention.Core - FusedQKVDense → attention.QKVProjection - backends.OpTypeFusedMultiHeadSDPA → removed - backends.OpTypeFusedQKVDense → removed

Add ONNX fusion pattern detectors and emitters for quantized operations: - fusion_quantized_dense.go: Detects DynamicQuantizeLinear + MatMulInteger chains and emits nn.QuantizedDense with Int8 format - fusion_quantized_qkv.go: Merges three quantized dense projections (Q, K, V) into a single batched QuantizedDense call - fusion_quantized_sdpa.go: Detects quantized scaled dot-product attention patterns and emits BackendFusedQuantizedScaledDotProductAttention Also fixes: - convertDequantizeLinear now takes Model receiver for fusion-aware access - Add isZeroInitializer helper for detecting zero-valued ONNX tensors - Fix GQA (Grouped Query Attention) head mismatch: replicate KV heads before calling attention.Core when kvNumHeads < numHeads

- Create internal/onnxgraph package for graph helpers (BuildConsumerMap, SoleConsumer, OtherBinaryOpInput, HasExternalConsumers) - Remove redundant graph/consumers params; store consumers on Model - Consolidate shape helpers into Model.ShapeForName (new shapes.go) - Rename DenseGeluParams → DenseActivationParams - Prefix all SDPA-specific helpers with sdpa, add model-family docs - Move TensorProtoToScalar/ConstantNodeToScalar to tensor.go, use float16 package instead of manual half-float conversion - Pre-concatenate QKV weights during fusion detection, add FreeUnusedVariables method - Add tests for tensorProtoRawBytes, concatenateTensorProtos, TensorProtoToScalar, ConstantNodeToScalar, FreeUnusedVariables, and Mul-scaled SDPA pattern

Resolve go.sum conflict and update quantized fusion files to use the new interface-based fusion API: - FusionDetector signature: func(m *Model) []FusionCandidate - Helper functions moved to onnxgraph package (SoleConsumer, OtherBinaryOpInput, HasExternalConsumers) - Drop sdpa prefix from shared utility methods (tryGetConstantScalar, isMaskRankAcceptable, extractHeadCounts, matchKTranspose, etc.)

ajroetker added 11 commits February 2, 2026 15:52

chore: update go.mod's

bbf522b

Merge remote-tracking branch 'upstream/main' into add-fusion-support

8d48cf2

# Conflicts: # go.mod # go.sum # onnx/ops.go

Update gomlx dependency to latest pre-release

c8d0793

Bump github.com/gomlx/gomlx from v0.26.1-0.20260211111746-dd3d906b02a6 to v0.26.1-0.20260215082710-429182c8560c.

Merge remote-tracking branch 'upstream/main' into add-fusion-support

9710b52

Update go.sum

21b67bd

Merge remote-tracking branch 'upstream/main' into add-fusion-support

73974ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantized fusion patterns for dense, QKV, and SDPA#60

Add quantized fusion patterns for dense, QKV, and SDPA#60
ajroetker wants to merge 11 commits intogomlx:mainfrom
ajroetker:quantized-fused-ops

ajroetker commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajroetker commented Feb 18, 2026

Summary

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant