Version: 1.0.8 License: GPL-3.0
Multimodal prompt generator nodes for ComfyUI, designed to generate prompts for Qwen-Image-Edit and Wan2.2.
Supports local LLM / local GGUF models (Qwen2.5-VL, Qwen3-VL) and Qwen API for image and video prompt generation and enhancement.
Based on extensive testing, Wan2.2 and Qwen-Image-Edit respond significantly better to Chinese prompts than English prompts.
Recommendation: Set target_language to "zh" (Chinese) for best results with these models, even if your input is in English. The models will generate more coherent and instruction-following outputs.
Starting from v1.0.8, image input for Qwen2.5-VL is now available in version 0.3.16 of llama-cpp-python(official). Vision input support varies by model and llama-cpp-python version. See Installation section for detailed compatibility information. Results may vary based on your specific environment.
Recommendation: Use Qwen3-VL with Qwen-Image-Edit. Qwen2.5-VL currently shows insufficient adherence to user prompts under the existing system prompt configuration.
Starting from v1.0.6, internal GGUF model handling has been improved to ensure stable behavior when switching between different Qwen3-VL models (e.g. 8B ↔ 4B), with mmproj files now being properly reloaded as part of the model switching process.
These changes are internal and do not affect node interfaces or workflows.
- Flexible prompting styles:
raw: Direct LLM response without system promptdefault: Balanced prompt enhancementdetailed: Rich visual details (colors, textures, lighting, atmosphere)concise: Minimal keywords, focused on core elementscreative: Artistic interpretation with unique perspectives
- Multi-image input: Support batch image input via ComfyUI's batch nodes (e.g., Images Batch Multiple)
- Local GGUF support: Run Qwen2.5-VL and Qwen3-VL models locally
- Auto-detect mmproj: Automatic detection or manual selection
- Image editing prompts: Specialized for Qwen-Image-Edit tasks
- Optimized for Chinese: Better performance with Chinese language prompts
- Multi-image support: Up to 3 images via optional inputs (image2/image3)
- Dynamic model selection: Auto-detect local GGUF models and cloud API models
- Auto-detect mmproj: Automatic detection or manual selection
- API key management: Centralized configuration via
api_key.txt
- Video generation prompts: Optimized for Wan2.2 text-to-video and image-to-video
- Task-specific optimization: Separate prompts for T2V and I2V workflows
- Optimized for Chinese: Better performance with Chinese language prompts
- Extended token limit: 2048 tokens to support longer Chinese prompts (600+ characters)
- Dynamic model selection: Auto-detect local GGUF models and cloud API models
- Auto-detect mmproj: Automatic detection or manual selection
- API key management: Centralized configuration via
api_key.txt
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/yourusername/ComfyUI-MultiModal-Prompt-Nodes.gitcd ComfyUI-MultiModal-Prompt-Nodes
pip install -r requirements.txtAlternative manual installation:
pip install dashscope pillow numpyImportant: Model compatibility varies by llama-cpp-python version. Based on my testing environment:
| Version | Qwen2.5-VL | Qwen3-VL |
|---|---|---|
| 0.3.16 (official) | ✅ | ❌ |
| 0.3.21+ (JamePeng fork) | ✅ | ✅ |
*Note: Vision input support may vary depending on your environment and configuration.
Recommended Installation (JamePeng fork for Qwen3-VL support):
Please follow the build and installation instructions provided in the JamePeng fork repository, as this fork requires a custom build and cannot be reliably installed via a simple pip install.
Source: https://github.com/JamePeng/llama-cpp-python
Place your GGUF models in ComfyUI/models/LLM/:
ComfyUI/models/LLM/
├── Qwen2.5VL-7B-F16_0.gguf
├── Qwen3VL-8B-Instruct-Q8_0.gguf
├── mmproj-Qwen2.5-VL-7B-Instruct-F16.gguf
└── mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf
For cloud API usage, create api_key.txt in the node folder:
ComfyUI/custom_nodes/ComfyUI-MultiModal-Prompt-Nodes/api_key.txt
Add your Alibaba Cloud Dashscope API key to this file.
Inputs:
prompt: Text prompt to rewrite/enhancestyle: Prompt rewriting styleraw: Direct LLM response without system prompt (useful for custom prompting)default: Balanced prompt enhancementdetailed: Rich visual detailsconcise: Minimal, focused keywordscreative: Artistic interpretation
target_language: Output language (auto/en/zh)model: Select from auto-detected local GGUF modelsmmproj: mmproj file selection(Auto-detect): Automatically search for matching mmproj(Not required): For text-only mode- Specific file: Manually select mmproj file
max_tokens: Maximum tokens to generate (default: 512)temperature: Sampling temperature (0.0-2.0, default: 0.7)device: CPU or GPU executionimage(optional): Input image for vision-language processing
Example workflow:
- Load Vision LLM Node
- Enter basic prompt: "a cat sitting on a windowsill"
- Attach image via batch node (optional)
- Select model
- Choose
(Auto-detect)for mmproj or select specific file - Select style:
default - Set device:
CPUorGPU - Run to get enhanced prompt
Inputs:
image: Primary input image (required)prompt: Edit instruction or image descriptionprompt_style:Qwen-Image-Edit: For image editing tasksQwen-Image: For general image understanding
target_language: Output language (auto/zh/en)llm_model: Model selectionLocal: xxx: Local GGUF models (auto-detected)- API models: qwen-vl-max, qwen-plus, etc.
mmproj: mmproj file (required for local models)(Auto-detect): Automatic detection(Not required): For API models or text-only mode- Specific file: Manual selection
max_retries: Retry attempts for API calls (default: 3)device: CPU/GPU selection for local modelssave_tokens: Compress images to save API tokensimage2/image3(optional): Additional context images
Use cases:
- Image editing prompt generation
- Multi-image context prompts
- Style transfer descriptions
- Visual question answering
Recommended settings:
- For best results: Set
target_languagetozh(Chinese) - Use local models for privacy, API models for quality
- Enable
save_tokenswhen using API models
Inputs:
prompt: Video scene descriptiontask_type:Text-to-Video: Generate video from text descriptionImage-to-Video: Generate video from image + text
target_language: Output language (auto/zh/en)llm_model: Model selectionLocal: xxx: Local GGUF models- API models: qwen-vl-max (for I2V), qwen-plus, etc.
mmproj: mmproj selection (same as other nodes)max_retries: API retry attemptsdevice: CPU/GPU for local modelssave_tokens: Image compression for APIimage(optional): Reference frame for I2V tasks
Optimized for:
- Wan2.2 video generation
- Temporal coherence descriptions
- Camera movement instructions
- Scene transitions
Important notes:
- Use Chinese prompts (
target_language: zh) for best results - Supports up to 600+ Chinese characters (2048 tokens)
- For I2V tasks, use
qwen-vl-*models
Example T2V workflow:
- Enter prompt: "一只猫在窗台上看风景" (A cat looking at scenery on a windowsill)
- Set
task_type: Text-to-Video - Set
target_language: zh - Select model (local or API)
- Run to get optimized video prompt
Example I2V workflow:
- Attach input image
- Enter motion description: "镜头慢慢推进" (Camera slowly zooms in)
- Set
task_type: Image-to-Video - Set
target_language: zh - Ensure model supports vision (qwen-vl-*)
- Run to get I2V prompt
- ✅ Qwen2.5-VL(3B/7B): Full vision support
- ✅ Requires matching mmproj file
- ❌ Insufficient adherence to user prompts under the existing system prompt configuration with Qwen-Image-Edit
- ✅ Qwen3-VL(4B/7B): Full vision support with JamePeng fork
- ✅ Requires matching mmproj file
- Q4_K_M: Balanced quality/size (recommended for most users)
- Q5_K_M: Higher quality, larger size
- Q8_0: Maximum quality, largest size
- Qwen models: https://huggingface.co/Qwen
- GGUF conversions: https://huggingface.co/models?search=qwen+gguf
- mmproj files: Usually bundled with GGUF conversions
- RAM: 8GB+ (16GB recommended for 7B models)
- Storage: 3-8GB per model (depending on quantization)
- GPU: Optional (CPU execution supported)
- NVIDIA GPU: CUDA support via llama-cpp-python
- AMD GPU: ROCm support (requires specific build)
- Intel Arc: Limited support, CPU recommended
- Use Q4_K_M quantization for faster inference and lower memory usage
- Reduce max_tokens if hitting memory limits
- Enable GPU if you have compatible hardware (select
GPUin device dropdown) - Use CPU for stability if encountering GPU issues
- Batch multiple requests when possible for efficiency
- Close other applications to free up RAM during inference
| Model | Quantization | RAM Usage |
|---|---|---|
| Qwen3-VL-4B | Q4_K_M | ~4-5GB |
| Qwen3-VL-4B | Q8_0 | ~7-8GB |
| Qwen3-VL-7B | Q4_K_M | ~6-7GB |
| Qwen3-VL-7B | Q8_0 | ~12-14GB |
Q: "No module named 'llama_cpp'" error
A: Install llama-cpp-python: pip install llama-cpp-python==0.3.21 --break-system-packages
Q: pip install fails with "externally-managed-environment"
A: Use --break-system-packages flag or create a virtual environment
Q: "Failed to load model" with Qwen3-VL
A: Ensure you're using llama-cpp-python 0.3.21+ (JamePeng fork). Version 0.3.16 doesn't support Qwen3-VL.
Q: "mmproj not specified" error
A: Select an mmproj file (or choose (Auto-detect)) in the mmproj dropdown for local models
Q: "No models found" in model dropdown
A:
- Place GGUF models in
ComfyUI/models/LLM/ - Restart ComfyUI
- Verify file extensions are
.gguf
Q: Vision input not working with Qwen2.5-VL
A: Use v1.0.8 or later. Fixed bug.
Q: Out of memory errors
A:
- Use smaller quantization (Q4_K_M instead of Q8_0)
- Reduce
max_tokensparameter - Close other applications
- Use a smaller model (4B instead of 7B)
Q: Slow inference on CPU
A: Normal for large models. Consider:
- Q4_K_M quantization (faster than Q8_0)
- Smaller models (4B faster than 7B)
- GPU acceleration if available
Q: "API_KEY is not set" error with local models
A: This error should only appear when using API models. If using local models (starting with "Local:"), this is a bug - please report it.
Q: Wan2.2 output is incoherent or doesn't follow instructions
A: Set target_language to zh (Chinese). Wan2.2 performs significantly better with Chinese prompts, even if your input is in English.
Q: Qwen-Image-Edit not understanding my edits
A:
- Use
target_language: zhfor better results - Be specific in edit instructions
- Try using reference examples in your prompt
Q: Output is cut off or incomplete
A: Increase max_tokens parameter (Vision LLM Node) or note that other nodes have fixed limits (512 for Qwen, 2048 for Wan)
Q: How to choose between CPU and GPU?
A:
- GPU: Faster inference, requires compatible hardware (NVIDIA with CUDA)
- CPU: Universal compatibility, slower but stable
- Recommendation: Start with CPU, switch to GPU if available and working
Q: GPU selected but still using CPU
A: Your GPU may not be compatible with llama-cpp-python. Check:
- NVIDIA GPU with CUDA support
- llama-cpp-python built with CUDA support
- Driver installation
- Create
api_key.txtin the node directory:
ComfyUI/custom_nodes/ComfyUI-MultiModal-Prompt-Nodes/api_key.txt
-
Add your Alibaba Cloud Dashscope API key (single line, no quotes)
-
The key will be automatically loaded by Qwen and Wan nodes when using cloud API models
- Never commit
api_key.txtto version control - The file is listed in
.gitignoreby default - API keys are only loaded when using cloud API models
- Local models don't require API keys
See the examples/ directory for:
- Basic prompt enhancement workflows
- Multi-image vision processing
- Image editing prompt generation
- Video prompt generation (T2V and I2V)
- Style-specific optimizations
This project is licensed under the GNU General Public License v3.0.
Copyright (C) 2026 kantan-kanto
GitHub: https://github.com/kantan-kanto
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Note: GPL-3.0 is required due to llama-cpp-python dependency.
For full details, see the LICENSE file and AUTHORS.md.
This repository may introduce internal structural changes over time (e.g. extracting Local GGUF or Cloud API implementations into separate modules) to improve maintainability and stability.
- Node interfaces (INPUT / RETURN types) are intended to remain stable
- Internal refactors will be documented in the changelog
- The
backends/directory added in v1.0.6 is a non-functional placeholder for future internal refactoring
No user action is required.
This project is a restructured and extended ComfyUI custom node collection, derived from the following GPL-3.0 licensed projects:
For detailed attribution, file-level mapping, and contribution notes, see AUTHORS.md.
- llama-cpp-python: Andrei Betlen
- Qwen3-VL support: JamePeng's llama-cpp-python fork
- Qwen models: Alibaba Cloud Qwen Team
- Dashscope API: Alibaba Cloud
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Areas needing help:
- Testing on different hardware configurations
- Documenting vision input compatibility across environments
- Additional workflow examples
- Performance optimizations
- Issues: Report bugs or request features via GitHub Issues
- Documentation: See CHANGELOG.md for version history
- Examples: Check examples/ for workflow templates
See CHANGELOG.md for detailed version history.
- Fixed issue where
Qwen2.5-VLwere always loaded in text-only mode even when a valid mmproj file was specified. - Improved mmproj auto-detection logic.