[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171

SS-JIA · 2026-02-03T21:27:27Z

Stack from ghstack (oldest at bottom):

Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

GLSL Shader (q8ta_clone.glsl):
- Uses block-based dispatch pattern matching q8ta_quantize/dequantize
- Each thread processes a 4x4 block of int8 values (16 elements)
- Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
- Loads int8x4 blocks using load_int8x4_block_from_t_inp()
- Transposes block via transpose_int8x4_block() when input/output
  have different packed dimensions
- Stores using store_int8x4_block_to_t_outp()
C++ Dispatch (Q8taClone.cpp):
- Creates BlockConfig for both input and output tensors using
  create_block_config_from_io_packed_dims() and
  create_block_config_from_other()
- Uses pick_linear_global_wg_with_block_config for workgroup sizing
- Passes hashed layouts and packed block configs as specialization
  constants
Clone.cpp Integration:
- Added check for kInt8x4 dtype on both input and output tensors
- Routes to add_q8ta_clone_node() for int8x4 tensor cloning
- Preserves existing behavior for all other tensor types
Test Infrastructure:
- TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
- test_q8ta_clone.cpp: Test driver with 800 test cases
- Tests all 25 combinations of input/output quantized layouts
- Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56

Differential Revision: D92196648

Implements q8ta_clone, a block-based shader for copying data between int8x4 (quantized) tensors with potentially different memory layouts. This is needed when quantized activations need to be copied between tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C to kPackedInt8_4C) without going through dequantize/requantize. Implementation details: 1. GLSL Shader (q8ta_clone.glsl): - Uses block-based dispatch pattern matching q8ta_quantize/dequantize - Each thread processes a 4x4 block of int8 values (16 elements) - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors - Loads int8x4 blocks using load_int8x4_block_from_t_inp() - Transposes block via transpose_int8x4_block() when input/output have different packed dimensions - Stores using store_int8x4_block_to_t_outp() 2. C++ Dispatch (Q8taClone.cpp): - Creates BlockConfig for both input and output tensors using create_block_config_from_io_packed_dims() and create_block_config_from_other() - Uses pick_linear_global_wg_with_block_config for workgroup sizing - Passes hashed layouts and packed block configs as specialization constants 3. Clone.cpp Integration: - Added check for kInt8x4 dtype on both input and output tensors - Routes to add_q8ta_clone_node() for int8x4 tensor cloning - Preserves existing behavior for all other tensor types 4. Test Infrastructure: - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize - test_q8ta_clone.cpp: Test driver with 800 test cases - Tests all 25 combinations of input/output quantized layouts - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56 Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/) [ghstack-poisoned]

pytorch-bot · 2026-02-03T21:27:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17171

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 75a2b57 with merge base 593775b ():

NEW FAILURES - The following jobs have failed:

Lint / android-java-format / linux-job (gh)
pull / android / build-llm-demo / linux-job (gh)
RuntimeError: Command docker exec -t da53f4dd25a60762010c27721990d1d798bec60a3c9caf2edb1c75eaa9ca945d /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 3f120b61a758027649426df2cdaaa34cd331fb6e78a384b80698cadc7984a091 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Implements q8ta_clone, a block-based shader for copying data between int8x4 (quantized) tensors with potentially different memory layouts. This is needed when quantized activations need to be copied between tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C to kPackedInt8_4C) without going through dequantize/requantize. Implementation details: 1. GLSL Shader (q8ta_clone.glsl): - Uses block-based dispatch pattern matching q8ta_quantize/dequantize - Each thread processes a 4x4 block of int8 values (16 elements) - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors - Loads int8x4 blocks using load_int8x4_block_from_t_inp() - Transposes block via transpose_int8x4_block() when input/output have different packed dimensions - Stores using store_int8x4_block_to_t_outp() 2. C++ Dispatch (Q8taClone.cpp): - Creates BlockConfig for both input and output tensors using create_block_config_from_io_packed_dims() and create_block_config_from_other() - Uses pick_linear_global_wg_with_block_config for workgroup sizing - Passes hashed layouts and packed block configs as specialization constants 3. Clone.cpp Integration: - Added check for kInt8x4 dtype on both input and output tensors - Routes to add_q8ta_clone_node() for int8x4 tensor cloning - Preserves existing behavior for all other tensor types 4. Test Infrastructure: - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize - test_q8ta_clone.cpp: Test driver with 800 test cases - Tests all 25 combinations of input/output quantized layouts - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56 Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/) ghstack-source-id: 337986304 Pull Request resolved: #17171

github-actions · 2026-02-03T21:28:18Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… tensors" Implements q8ta_clone, a block-based shader for copying data between int8x4 (quantized) tensors with potentially different memory layouts. This is needed when quantized activations need to be copied between tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C to kPackedInt8_4C) without going through dequantize/requantize. Implementation details: 1. GLSL Shader (q8ta_clone.glsl): - Uses block-based dispatch pattern matching q8ta_quantize/dequantize - Each thread processes a 4x4 block of int8 values (16 elements) - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors - Loads int8x4 blocks using load_int8x4_block_from_t_inp() - Transposes block via transpose_int8x4_block() when input/output have different packed dimensions - Stores using store_int8x4_block_to_t_outp() 2. C++ Dispatch (Q8taClone.cpp): - Creates BlockConfig for both input and output tensors using create_block_config_from_io_packed_dims() and create_block_config_from_other() - Uses pick_linear_global_wg_with_block_config for workgroup sizing - Passes hashed layouts and packed block configs as specialization constants 3. Clone.cpp Integration: - Added check for kInt8x4 dtype on both input and output tensors - Routes to add_q8ta_clone_node() for int8x4 tensor cloning - Preserves existing behavior for all other tensor types 4. Test Infrastructure: - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize - test_q8ta_clone.cpp: Test driver with 800 test cases - Tests all 25 combinations of input/output quantized layouts - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56 Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/) [ghstack-poisoned]

Pull Request resolved: #17171 Implements q8ta_clone, a block-based shader for copying data between int8x4 (quantized) tensors with potentially different memory layouts. This is needed when quantized activations need to be copied between tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C to kPackedInt8_4C) without going through dequantize/requantize. Implementation details: 1. GLSL Shader (q8ta_clone.glsl): - Uses block-based dispatch pattern matching q8ta_quantize/dequantize - Each thread processes a 4x4 block of int8 values (16 elements) - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors - Loads int8x4 blocks using load_int8x4_block_from_t_inp() - Transposes block via transpose_int8x4_block() when input/output have different packed dimensions - Stores using store_int8x4_block_to_t_outp() 2. C++ Dispatch (Q8taClone.cpp): - Creates BlockConfig for both input and output tensors using create_block_config_from_io_packed_dims() and create_block_config_from_other() - Uses pick_linear_global_wg_with_block_config for workgroup sizing - Passes hashed layouts and packed block configs as specialization constants 3. Clone.cpp Integration: - Added check for kInt8x4 dtype on both input and output tensors - Routes to add_q8ta_clone_node() for int8x4 tensor cloning - Preserves existing behavior for all other tensor types 4. Test Infrastructure: - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize - test_q8ta_clone.cpp: Test driver with 800 test cases - Tests all 25 combinations of input/output quantized layouts - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56 ghstack-source-id: 337988012 @exported-using-ghexport Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)

This was referenced Feb 2, 2026

[ET-VK][testing] Add per-shader timing breakdown to benchmark output #17105

Open

[ET-VK] Add alignment fields to PackedDimInfo for padded size calculation #17170

Open

This was referenced Feb 2, 2026

[ET-VK][quantization] Implement layout-flexible quantize/dequantize operators #17106

Open

[ET-VK][ez] Implement helper functions to get fastest moving dim #17107

Open

[ET-VK][qconv] Add layout-flexible impl of quantized depthwise conv2d #17108

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171

[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171

SS-JIA commented Feb 3, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171

Are you sure you want to change the base?

[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171

Conversation

SS-JIA commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17171

❌ 3 New Failures

Uh oh!

github-actions bot commented Feb 3, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Feb 3, 2026 •

edited

Loading

pytorch-bot bot commented Feb 3, 2026 •

edited

Loading

This PR needs a `release notes:` label