Skip to content

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Feb 3, 2026

Stack from ghstack (oldest at bottom):

Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

  1. GLSL Shader (q8ta_clone.glsl):

    • Uses block-based dispatch pattern matching q8ta_quantize/dequantize
    • Each thread processes a 4x4 block of int8 values (16 elements)
    • Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
    • Loads int8x4 blocks using load_int8x4_block_from_t_inp()
    • Transposes block via transpose_int8x4_block() when input/output
      have different packed dimensions
    • Stores using store_int8x4_block_to_t_outp()
  2. C++ Dispatch (Q8taClone.cpp):

    • Creates BlockConfig for both input and output tensors using
      create_block_config_from_io_packed_dims() and
      create_block_config_from_other()
    • Uses pick_linear_global_wg_with_block_config for workgroup sizing
    • Passes hashed layouts and packed block configs as specialization
      constants
  3. Clone.cpp Integration:

    • Added check for kInt8x4 dtype on both input and output tensors
    • Routes to add_q8ta_clone_node() for int8x4 tensor cloning
    • Preserves existing behavior for all other tensor types
  4. Test Infrastructure:

    • TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
    • test_q8ta_clone.cpp: Test driver with 800 test cases
    • Tests all 25 combinations of input/output quantized layouts
    • Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56

Differential Revision: D92196648

Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

1. GLSL Shader (q8ta_clone.glsl):
   - Uses block-based dispatch pattern matching q8ta_quantize/dequantize
   - Each thread processes a 4x4 block of int8 values (16 elements)
   - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
   - Loads int8x4 blocks using load_int8x4_block_from_t_inp()
   - Transposes block via transpose_int8x4_block() when input/output
     have different packed dimensions
   - Stores using store_int8x4_block_to_t_outp()

2. C++ Dispatch (Q8taClone.cpp):
   - Creates BlockConfig for both input and output tensors using
     create_block_config_from_io_packed_dims() and
     create_block_config_from_other()
   - Uses pick_linear_global_wg_with_block_config for workgroup sizing
   - Passes hashed layouts and packed block configs as specialization
     constants

3. Clone.cpp Integration:
   - Added check for kInt8x4 dtype on both input and output tensors
   - Routes to add_q8ta_clone_node() for int8x4 tensor cloning
   - Preserves existing behavior for all other tensor types

4. Test Infrastructure:
   - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
   - test_q8ta_clone.cpp: Test driver with 800 test cases
   - Tests all 25 combinations of input/output quantized layouts
   - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56

Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17171

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 75a2b57 with merge base 593775b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026
SS-JIA pushed a commit that referenced this pull request Feb 3, 2026
Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

1. GLSL Shader (q8ta_clone.glsl):
   - Uses block-based dispatch pattern matching q8ta_quantize/dequantize
   - Each thread processes a 4x4 block of int8 values (16 elements)
   - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
   - Loads int8x4 blocks using load_int8x4_block_from_t_inp()
   - Transposes block via transpose_int8x4_block() when input/output
     have different packed dimensions
   - Stores using store_int8x4_block_to_t_outp()

2. C++ Dispatch (Q8taClone.cpp):
   - Creates BlockConfig for both input and output tensors using
     create_block_config_from_io_packed_dims() and
     create_block_config_from_other()
   - Uses pick_linear_global_wg_with_block_config for workgroup sizing
   - Passes hashed layouts and packed block configs as specialization
     constants

3. Clone.cpp Integration:
   - Added check for kInt8x4 dtype on both input and output tensors
   - Routes to add_q8ta_clone_node() for int8x4 tensor cloning
   - Preserves existing behavior for all other tensor types

4. Test Infrastructure:
   - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
   - test_q8ta_clone.cpp: Test driver with 800 test cases
   - Tests all 25 combinations of input/output quantized layouts
   - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56

Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)

ghstack-source-id: 337986304
Pull Request resolved: #17171
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… tensors"

Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

1. GLSL Shader (q8ta_clone.glsl):
   - Uses block-based dispatch pattern matching q8ta_quantize/dequantize
   - Each thread processes a 4x4 block of int8 values (16 elements)
   - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
   - Loads int8x4 blocks using load_int8x4_block_from_t_inp()
   - Transposes block via transpose_int8x4_block() when input/output
     have different packed dimensions
   - Stores using store_int8x4_block_to_t_outp()

2. C++ Dispatch (Q8taClone.cpp):
   - Creates BlockConfig for both input and output tensors using
     create_block_config_from_io_packed_dims() and
     create_block_config_from_other()
   - Uses pick_linear_global_wg_with_block_config for workgroup sizing
   - Passes hashed layouts and packed block configs as specialization
     constants

3. Clone.cpp Integration:
   - Added check for kInt8x4 dtype on both input and output tensors
   - Routes to add_q8ta_clone_node() for int8x4 tensor cloning
   - Preserves existing behavior for all other tensor types

4. Test Infrastructure:
   - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
   - test_q8ta_clone.cpp: Test driver with 800 test cases
   - Tests all 25 combinations of input/output quantized layouts
   - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56

Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)

[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Feb 3, 2026
Pull Request resolved: #17171

Implements q8ta_clone, a block-based shader for copying data between
int8x4 (quantized) tensors with potentially different memory layouts.

This is needed when quantized activations need to be copied between
tensors with different packed int8 layouts (e.g., from kPackedInt8_4W4C
to kPackedInt8_4C) without going through dequantize/requantize.

Implementation details:

1. GLSL Shader (q8ta_clone.glsl):
   - Uses block-based dispatch pattern matching q8ta_quantize/dequantize
   - Each thread processes a 4x4 block of int8 values (16 elements)
   - Uses linear dispatch via gl_GlobalInvocationID.x for buffer tensors
   - Loads int8x4 blocks using load_int8x4_block_from_t_inp()
   - Transposes block via transpose_int8x4_block() when input/output
     have different packed dimensions
   - Stores using store_int8x4_block_to_t_outp()

2. C++ Dispatch (Q8taClone.cpp):
   - Creates BlockConfig for both input and output tensors using
     create_block_config_from_io_packed_dims() and
     create_block_config_from_other()
   - Uses pick_linear_global_wg_with_block_config for workgroup sizing
   - Passes hashed layouts and packed block configs as specialization
     constants

3. Clone.cpp Integration:
   - Added check for kInt8x4 dtype on both input and output tensors
   - Routes to add_q8ta_clone_node() for int8x4 tensor cloning
   - Preserves existing behavior for all other tensor types

4. Test Infrastructure:
   - TestQ8taClone.cpp: Custom op that chains quantize -> clone -> dequantize
   - test_q8ta_clone.cpp: Test driver with 800 test cases
   - Tests all 25 combinations of input/output quantized layouts
   - Tests multiple tensor shapes from 1x3x16x16 to 1x128x56x56
ghstack-source-id: 337988012
@exported-using-ghexport

Differential Revision: [D92196648](https://our.internmc.facebook.com/intern/diff/D92196648/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants