-
Notifications
You must be signed in to change notification settings - Fork 828
[ET-VK][quantization] Add layout-flexible clone for int8x4 tensors #17171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/SS-JIA/406/base
Are you sure you want to change the base?
Conversation
Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.
This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.
Implementation details:
1. GLSL Shader (q8ta_clone.glsl):
- Uses block-based dispatch pattern matching q8ta_quantize/dequantize
- Each thread processes a 4x4 block of int8 values (16 elements)
- Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
- Loads int8x4 blocks using load_int8x4_block_from_t_inp()
- Transposes block via transpose_int8x4_block() when input/output
have different packed dimensions
- Stores using store_int8x4_block_to_t_outp()
2. C++ Dispatch (Q8taClone.cpp):
- Creates BlockConfig for both input and output tensors using
create_block_config_from_io_packed_dims() and
create_block_config_from_other()
- Uses pick_linear_global_wg_with_block_config for workgroup sizing
- Passes hashed layouts and packed block configs as specialization
constants
3. Clone.cpp Integration:
- Added check for kInt8x4 dtype on both input and output tensors
- Routes to add_q8ta_clone_node() for int8x4 tensor cloning
- Preserves existing behavior for all other tensor types
4. Test Infrastructure:
- TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
- test_q8ta_clone.cpp: Test driver with 800 test cases
- Tests all 25 combinations of input/output quantized layouts
- Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56
Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17171
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 75a2b57 with merge base 593775b ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.
This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.
Implementation details:
1. GLSL Shader (q8ta_clone.glsl):
- Uses block-based dispatch pattern matching q8ta_quantize/dequantize
- Each thread processes a 4x4 block of int8 values (16 elements)
- Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
- Loads int8x4 blocks using load_int8x4_block_from_t_inp()
- Transposes block via transpose_int8x4_block() when input/output
have different packed dimensions
- Stores using store_int8x4_block_to_t_outp()
2. C++ Dispatch (Q8taClone.cpp):
- Creates BlockConfig for both input and output tensors using
create_block_config_from_io_packed_dims() and
create_block_config_from_other()
- Uses pick_linear_global_wg_with_block_config for workgroup sizing
- Passes hashed layouts and packed block configs as specialization
constants
3. Clone.cpp Integration:
- Added check for kInt8x4 dtype on both input and output tensors
- Routes to add_q8ta_clone_node() for int8x4 tensor cloning
- Preserves existing behavior for all other tensor types
4. Test Infrastructure:
- TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
- test_q8ta_clone.cpp: Test driver with 800 test cases
- Tests all 25 combinations of input/output quantized layouts
- Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56
Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)
ghstack-source-id: 337986304
Pull Request resolved: #17171
This PR needs a
|
… tensors"
Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.
This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.
Implementation details:
1. GLSL Shader (q8ta_clone.glsl):
- Uses block-based dispatch pattern matching q8ta_quantize/dequantize
- Each thread processes a 4x4 block of int8 values (16 elements)
- Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
- Loads int8x4 blocks using load_int8x4_block_from_t_inp()
- Transposes block via transpose_int8x4_block() when input/output
have different packed dimensions
- Stores using store_int8x4_block_to_t_outp()
2. C++ Dispatch (Q8taClone.cpp):
- Creates BlockConfig for both input and output tensors using
create_block_config_from_io_packed_dims() and
create_block_config_from_other()
- Uses pick_linear_global_wg_with_block_config for workgroup sizing
- Passes hashed layouts and packed block configs as specialization
constants
3. Clone.cpp Integration:
- Added check for kInt8x4 dtype on both input and output tensors
- Routes to add_q8ta_clone_node() for int8x4 tensor cloning
- Preserves existing behavior for all other tensor types
4. Test Infrastructure:
- TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
- test_q8ta_clone.cpp: Test driver with 800 test cases
- Tests all 25 combinations of input/output quantized layouts
- Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56
Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)
[ghstack-poisoned]
Pull Request resolved: #17171 Implements q8ta_clone, a block-based shader for copying data between int8x4 (quantized) tensors with potentially different memory layouts. This is needed when quantized activations need to be copied between tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C to kPackedInt8_4C) without going through dequantize/requantize. Implementation details: 1. GLSL Shader (q8ta_clone.glsl): - Uses block-based dispatch pattern matching q8ta_quantize/dequantize - Each thread processes a 4x4 block of int8 values (16 elements) - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors - Loads int8x4 blocks using load_int8x4_block_from_t_inp() - Transposes block via transpose_int8x4_block() when input/output have different packed dimensions - Stores using store_int8x4_block_to_t_outp() 2. C++ Dispatch (Q8taClone.cpp): - Creates BlockConfig for both input and output tensors using create_block_config_from_io_packed_dims() and create_block_config_from_other() - Uses pick_linear_global_wg_with_block_config for workgroup sizing - Passes hashed layouts and packed block configs as specialization constants 3. Clone.cpp Integration: - Added check for kInt8x4 dtype on both input and output tensors - Routes to add_q8ta_clone_node() for int8x4 tensor cloning - Preserves existing behavior for all other tensor types 4. Test Infrastructure: - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize - test_q8ta_clone.cpp: Test driver with 800 test cases - Tests all 25 combinations of input/output quantized layouts - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56 ghstack-source-id: 337988012 @exported-using-ghexport Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)
Stack from ghstack (oldest at bottom):
Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.
This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.
Implementation details:
GLSL Shader (q8ta_clone.glsl):
have different packed dimensions
C++ Dispatch (Q8taClone.cpp):
create_block_config_from_io_packed_dims() and
create_block_config_from_other()
constants
Clone.cpp Integration:
Test Infrastructure:
Differential Revision: D92196648