[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

seyeong-han · 2026-02-04T00:50:54Z

Summary

Add pybinding_run.py example script for running Gemma 3 multimodal inference with Python bindings
Update examples/models/gemma3/README.md with Python binding documentation
Add Gemma 3 example to extension/llm/runner/README.md Python API section

Key Changes

New Example Script (`examples/models/gemma3/pybinding_run.py`)

Complete CLI tool with argparse for model path, tokenizer, image, and prompt
Demonstrates loading quantized and custom operator libraries
Image preprocessing matching C++ runner (896x896, CHW, float32 normalized)
Gemma 3 chat template format for multimodal inputs
Token and stats callbacks with performance metrics

Documentation Updates

Added "Running with Python Bindings" section to gemma3 README
Added "Gemma 3 Multimodal Example" to LLM runner README
Documents key implementation details: operator loading, image preprocessing, chat template

Test plan

python examples/models/gemma3/pybinding_run.py \
  --model_path /path/to/gemma3_model.pte \
  --tokenizer_path /path/to/tokenizer.json \
  --image_path /path/to/image.png \
  --prompt "What is in this image?"

Result

python pybinding_run.py \
  --model_path /Users/younghan/project/executorch/gemma-3/gemma-3-4b-it-HQQ-INT8-INT4/model.pte \
  --tokenizer_path /Users/younghan/project/executorch/gemma-3/tokenizer.json \
  --image_path /Users/younghan/project/executorch/docs/source/_static/img/et-logo.png \
  --prompt "What is in this image?"

W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.186000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.187000 17672 site-packages/torch/utils/flop_counter.py:45] triton not f
ound; flop counting will not work for triton kernels                                  W0203 16:45:48.597000 17672 site-packages/torch/distributed/elastic/multiprocessing/re
directs.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.       objc[17672]: Class ETCoreMLModelManagerDelegate is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2b08) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2b08). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLAsset is implemented in both /Users/younghan/project/execut
orch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2b58) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2b58). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                                 objc[17672]: Class ETCoreMLAssetManager is implemented in both /Users/younghan/project
/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2ba8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2ba8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                          objc[17672]: Class ETCoreMLDefaultModelExecutor is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2bf8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2bf8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLModelLoader is implemented in both /Users/younghan/project/
executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2c70) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2c70). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                           objc[17672]: Class ETCoreMLModelCompiler is implemented in both /Users/younghan/projec
t/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2cc0) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2cc0). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                         objc[17672]: Class ETCoreMLErrorUtils is implemented in both /Users/younghan/project/e
xecutorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d10) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d10). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                            objc[17672]: Class ETCoreMLMultiArrayDescriptor is implemented in both /Users/younghan
/project/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d38) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d38). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                  objc[17672]: Class ETCoreMLModel is implemented in both /Users/younghan/project/execut
orch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2d88) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2d88). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                                 objc[17672]: Class ETCoreMLModelManager is implemented in both /Users/younghan/project
/executorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2dd8) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2dd8). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                          objc[17672]: Class ETCoreMLStrings is implemented in both /Users/younghan/project/exec
utorch/pip-out/temp.macosx-11.0-arm64-cpython-311/cmake-out/_portable_lib.cpython-311-darwin.so (0x1176a2e50) and /Users/younghan/miniconda3/envs/executorch/lib/python3.11/site-packages/executorch/extension/pybindings/_portable_lib.cpython-311-darwin.so (0x140fa2e50). This may cause spurious casting failures and mysterious crashes. One of the duplicates must be removed or renamed.                                               I tokenizers:regex.cpp:27] Registering override fallback regex
Loading model from: /Users/younghan/project/executorch/gemma-3/gemma-3-4b-it-HQQ-INT8-
INT4/model.pte                                                                        Loading tokenizer from: /Users/younghan/project/executorch/gemma-3/tokenizer.json
I tokenizers:hf_tokenizer.cpp:142] Setting up normalizer...
I tokenizers:hf_tokenizer.cpp:146] Normalizer set up
I tokenizers:hf_tokenizer.cpp:160] Setting up pretokenizer...
I tokenizers:hf_tokenizer.cpp:164] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:180] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:240] Loaded 513511 BPE merge rules
I tokenizers:hf_tokenizer.cpp:252] Built merge ranks map with 236249 entries
[llm_runner_helper.cpp:54] Loaded json tokenizer
[llm_runner_helper.cpp:293] Reading metadata from model
[llm_runner_helper.cpp:133] Metadata: use_sdpa_with_kv_cache = 1
[llm_runner_helper.cpp:133] Metadata: use_kv_cache = 1
[llm_runner_helper.cpp:131] Method get_max_context_len not found, using the default va
lue 128                                                                               [llm_runner_helper.cpp:133] Metadata: get_max_context_len = 128
[llm_runner_helper.cpp:133] Metadata: get_max_seq_len = 2048
[llm_runner_helper.cpp:131] Method enable_dynamic_shape not found, using the default v
alue 0                                                                                [llm_runner_helper.cpp:133] Metadata: enable_dynamic_shape = 0
[llm_runner_helper.cpp:144] Setting kMaxContextLen to kMaxSeqLen value: 2048
Loading image from: /Users/younghan/project/executorch/docs/source/_static/img/et-logo
.png                                                                                  Image tensor shape: torch.Size([3, 896, 896])

Prompt: What is in this image?
--------------------------------------------------
Response: [cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version
[cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version
[multimodal_runner.cpp:122] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:148] Prefilling input 0/3, type: text
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1770165954.529078 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529116 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529129 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165954.529164 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165954.529171 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'user' -> 'user'
E0000 00:00:1770165954.529195 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
[multimodal_runner.cpp:148] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:107] Image tensor dim: 4, dtype: Float
[multimodal_runner.cpp:148] Prefilling input 2/3, type: text
E0000 00:00:1770165965.090435 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090477 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090483 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090499 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'What is in this image?' -> 'What
▁is▁in▁this▁image?'                                                                   E0000 00:00:1770165965.090704 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090711 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090722 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165965.090731 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           E0000 00:00:1770165965.090742 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: '' -> ''
E0000 00:00:1770165965.090748 4408997 re2.cc:804] DFA out of memory: pattern length 96
883, program size 17069, list count 10654, bytemap range 45                           I tokenizers:hf_tokenizer.cpp:415] normalized input: 'model' -> 'model'
Okay[multimodal_runner.cpp:177] RSS after multimodal input processing: 0.000000 MiB (0
 if unsupported)                                                                      [multimodal_runner.cpp:189] Max new tokens resolved: 100, pos_ 271, max_context_len 20
48                                                                                    , let's analyze the image!

The image contains a stylized representation of a **chip** or microcircuit. You can se
e a grid-like pattern with lines and squares, which is characteristic of electronic components.                                                                             
Is there anything specific you'd like me to look for or explain about the image?<end_o
f_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn><end_of_turn>                                                                   PyTorchObserver {"prompt_tokens":271,"generated_tokens":99,"model_load_start_ms":17701
65951280,"model_load_end_ms":1770165954528,"inference_start_ms":1770165954529,"inference_end_ms":1770165971601,"prompt_eval_end_ms":1770165965258,"first_token_ms":1770165965258,"aggregate_sampling_time_ms":25,"SCALING_FACTOR_UNITS_PER_SECOND":1000}          [stats.h:143]   Prompt Tokens: 271    Generated Tokens: 99
[stats.h:149]   Model Load Time:                3.248000 (seconds)
[stats.h:159]   Total inference time:           17.072000 (seconds)              Rate:
        5.798969 (tokens/second)                                                      [stats.h:167]           Prompt evaluation:      10.729000 (seconds)              Rate:
        25.258645 (tokens/second)                                                     [stats.h:178]           Generated 99 tokens:    6.343000 (seconds)               Rate:
        15.607757 (tokens/second)                                                     [stats.h:186]   Time to first generated token:  10.729000 (seconds)
[stats.h:193]   Sampling time over 370 tokens:  0.025000 (seconds)

--------------------------------------------------
Prompt tokens: 271
Generated tokens: 99
Time to first token: 10.729 s
Generation rate: 15.61 tokens/sec

Next Step

Merge [llama] Add chat format support for Llama 3 Instruct models #16987 to fix EOS token handling in model export
Remove <end_of_turn> workaround from pybinding_run.py once models include correct EOS IDs

Add a complete Python example demonstrating how to run Gemma 3 vision-language inference using ExecuTorch's Python bindings. - Add pybinding_run.py with CLI interface for multimodal inference - Document required operator imports (quantized kernels, custom_sdpa) - Show image preprocessing (resize, HWC->CHW, normalize to [0,1]) - Include Gemma 3 chat template format - Update gemma3 README with Python binding usage section - Add Gemma 3 example to LLM runner README Python API docs

pytorch-bot · 2026-02-04T00:50:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17190

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-04T00:51:38Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

seyeong-han requested review from larryliu0820, lucylq and mergennachin as code owners February 4, 2026 00:50

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

seyeong-han commented Feb 4, 2026

Uh oh!

pytorch-bot bot commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

Are you sure you want to change the base?

[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

Conversation

seyeong-han commented Feb 4, 2026

Summary

Key Changes

New Example Script (examples/models/gemma3/pybinding_run.py)

Documentation Updates

Test plan

Result

Next Step

Uh oh!

pytorch-bot bot commented Feb 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17190

Uh oh!

github-actions bot commented Feb 4, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New Example Script (`examples/models/gemma3/pybinding_run.py`)

This PR needs a `release notes:` label