Skip to content

Conversation

@seyeong-han
Copy link
Contributor

Summary

  • Add pybinding_run.py example script for running Gemma 3 multimodal inference with Python bindings
  • Update examples/models/gemma3/README.md with Python binding documentation
  • Add Gemma 3 example to extension/llm/runner/README.md Python API section

Key Changes

New Example Script (examples/models/gemma3/pybinding_run.py)

  • Complete CLI tool with argparse for model path, tokenizer, image, and prompt
  • Demonstrates loading quantized and custom operator libraries
  • Image preprocessing matching C++ runner (896x896, CHW, float32 normalized)
  • Gemma 3 chat template format for multimodal inputs
  • Token and stats callbacks with performance metrics

Documentation Updates

  • Added "Running with Python Bindings" section to gemma3 README
  • Added "Gemma 3 Multimodal Example" to LLM runner README
  • Documents key implementation details: operator loading, image preprocessing, chat template

Test plan

python examples/models/gemma3/pybinding_run.py \
  --model_path /path/to/gemma3_model.pte \
  --tokenizer_path /path/to/tokenizer.json \
  --image_path /path/to/image.png \
  --prompt "What is in this image?"

Result

python pybinding_run.py \
  --model_path /Users/younghan/project/executorch/gemma-3/gemma-3-4b-it-HQQ-INT8-INT4/model.pte \
  --tokenizer_path /Users/younghan/project/executorch/gemma-3/tokenizer.json \
  --image_path /Users/younghan/project/executorch/docs/source/_static/img/et-logo.png \
  --prompt "What is in this image?"

W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.597000 17672 site-packages/torch/distributed/elastic/multiprocessing/re
directs.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.       objc[17672]: Class ETCoreMLModelManagerDelegate is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2b08) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2b08). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLAsset is implemented in both /Users/younghan/project/execut
orch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2b58) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2b58). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                                 objc[17672]: Class ETCoreMLAssetManager is implemented in both /Users/younghan/project
/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2ba8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2ba8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                          objc[17672]: Class ETCoreMLDefaultModelExecutor is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2bf8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2bf8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLModelLoader is implemented in both /Users/younghan/project/
executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2c70) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2c70). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                           objc[17672]: Class ETCoreMLModelCompiler is implemented in both /Users/younghan/projec
t/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2cc0) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2cc0). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                         objc[17672]: Class ETCoreMLErrorUtils is implemented in both /Users/younghan/project/e
xecutorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d10) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d10). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                            objc[17672]: Class ETCoreMLMultiArrayDescriptor is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d38) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d38). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLModel is implemented in both /Users/younghan/project/execut
orch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d88) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d88). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                                 objc[17672]: Class ETCoreMLModelManager is implemented in both /Users/younghan/project
/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2dd8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2dd8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                          objc[17672]: Class ETCoreMLStrings is implemented in both /Users/younghan/project/exec
utorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2e50) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2e50). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                               I tokenizers:regex.cpp:27] Registering override fallback regex
Loading model from: /Users/younghan/project/executorch/gemma-3/gemma-3-4b-it-HQQ-INT8-
INT4/model.pte                                                                        Loading tokenizer from: /Users/younghan/project/executorch/gemma-3/tokenizer.json
I tokenizers:hf_tokenizer.cpp:142] Setting up normalizer...
I tokenizers:hf_tokenizer.cpp:146] Normalizer set up
I tokenizers:hf_tokenizer.cpp:160] Setting up pretokenizer...
I tokenizers:hf_tokenizer.cpp:164] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:180] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:240] Loaded 513511 BPE merge rules
I tokenizers:hf_tokenizer.cpp:252] Built merge ranks map with 236249 entries
[llm_runner_helper.cpp:54] Loaded json tokenizer
[llm_runner_helper.cpp:293] Reading metadata from model
[llm_runner_helper.cpp:133] Metadata: use_sdpa_with_kv_cache = 1
[llm_runner_helper.cpp:133] Metadata: use_kv_cache = 1
[llm_runner_helper.cpp:131] Method get_max_context_len not found, using the default va
lue 128                                                                               [llm_runner_helper.cpp:133] Metadata: get_max_context_len = 128
[llm_runner_helper.cpp:133] Metadata: get_max_seq_len = 2048
[llm_runner_helper.cpp:131] Method enable_dynamic_shape not found, using the default v
alue 0                                                                                [llm_runner_helper.cpp:133] Metadata: enable_dynamic_shape = 0
[llm_runner_helper.cpp:144] Setting kMaxContextLen to kMaxSeqLen value: 2048
Loading image from: /Users/younghan/project/executorch/docs/source/_static/img/et-logo
.png                                                                                  Image tensor shape: torch.Size([3, 896, 896])

Prompt: What is in this image?
--------------------------------------------------
Response: [cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
[cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
[multimodal_runner.cpp:122] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:148] Prefilling input 0/3, type: text
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1770165954.529078 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529116 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529129 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165954.529164 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529171 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'user' -> 'user'
E0000 00:00:1770165954.529195 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
[multimodal_runner.cpp:148] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:107] Image tensor dim: 4, dtype: Float
[multimodal_runner.cpp:148] Prefilling input 2/3, type: text
E0000 00:00:1770165965.090435 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090477 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090483 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090499 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'What is in this image?' -> 'What
▁is▁in▁this▁image?'                                                                   E0000 00:00:1770165965.090704 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090711 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090722 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165965.090731 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090742 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165965.090748 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'model' -> 'model'
Okay[multimodal_runner.cpp:177] RSS after multimodal input processing: 0.000000 MiB (0
 if unsupported)                                                                      [multimodal_runner.cpp:189] Max new tokens resolved: 100, pos_ 271, max_context_len 20
48                                                                                    , let's analyze the image!

The image contains a stylized representation of a **chip** or microcircuit. You can se
e a grid-like pattern with lines and squares, which is characteristic of electronic components.                                                                             
Is there anything specific you'd like me to look for or explain about the image?<end_o
f_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn>                                                                   PyTorchObserver {"prompt_tokens":271,"generated_tokens":99,"model_load_start_ms":17701
65951280,"model_load_end_ms":1770165954528,"inference_start_ms":1770165954529,"inference_end_ms":1770165971601,"prompt_eval_end_ms":1770165965258,"first_token_ms":1770165965258,"aggregate_sampling_time_ms":25,"SCALING_FACTOR_UNITS_PER_SECOND":1000}          [stats.h:143]   Prompt Tokens: 271    Generated Tokens: 99
[stats.h:149]   Model Load Time:                3.248000 (seconds)
[stats.h:159]   Total inference time:           17.072000 (seconds)              Rate:
        5.798969 (tokens/second)                                                      [stats.h:167]           Prompt evaluation:      10.729000 (seconds)              Rate:
        25.258645 (tokens/second)                                                     [stats.h:178]           Generated 99 tokens:    6.343000 (seconds)               Rate:
        15.607757 (tokens/second)                                                     [stats.h:186]   Time to first generated token:  10.729000 (seconds)
[stats.h:193]   Sampling time over 370 tokens:  0.025000 (seconds)

--------------------------------------------------
Prompt tokens: 271
Generated tokens: 99
Time to first token: 10.729 s
Generation rate: 15.61 tokens/sec

Next Step

Add a complete Python example demonstrating how to run Gemma 3
vision-language inference using ExecuTorch's Python bindings.

- Add pybinding_run.py with CLI interface for multimodal inference
- Document required operator imports (quantized kernels, custom_sdpa)
- Show image preprocessing (resize, HWC->CHW, normalize to [0,1])
- Include Gemma 3 chat template format
- Update gemma3 README with Python binding usage section
- Add Gemma 3 example to LLM runner README Python API docs
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17190

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2026
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant