This suite of nodes unlocks high-performance parallel processing in ComfyUI by utilizing Model Replication. Unlike standard offloading which moves a single model instance between devices, these nodes create independent replicas of the model on each selected GPU/CPU, allowing for true simultaneous batch processing.
- Tested on Z_IMAGE, FLUX.1, WAN2.2
- True Parallel Execution: Simultaneous forward passes on multiple GPUs using thread-safe model replicas
- Chainable Device Nodes: Connect multiple
Parallel Device Confignodes to easily configure 2-8+ GPUs - Auto Hardware Detection: Dropdown menus automatically populated with available CUDA GPUs, CPU, Apple MPS, and Intel XPU
- Dynamic Load Balancing: Percentage-based batch splitting (e.g., 70% on RTX 3090, 30% on RTX 3060)
- Cross-Platform: Works on Windows, Linux, and macOS (MPS)
The main orchestration node. It intercepts the diffusion model's forward pass and triggers simultaneous compute kernels across all available replicas.
Allows you to build a DEVICE_CHAIN. You can chain multiple GPUs together or use the List node to quickly and their respective workload percentages.
- Option 1
- Option 2
- Console
- GPU Plot
- Workflow Picture
- Test Case with batch_Size = 21 (Z_Image Turbo) (1024x1024)
- Minimum: 2x GPUs or 1x GPU + CPU (for testing)
- Recommended: Identical GPUs for balanced loads
- VRAM: Each GPU must independently hold the full model (e.g., SDXL requires ~7GB per GPU)
- ComfyUI installed and functional
- PyTorch with appropriate backend:
# CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Apple Silicon (MPS)
pip install torch torchvision
# Intel XPU (experimental)
pip install intel-extension-for-pytorch
- Navigate to your ComfyUI custom nodes directory:
cd ComfyUI/custom_nodes/ - Clone this repository:
git clone https://github.com/FearL0rd/ComfyUI-ParallelAnything.git
- Restart ComfyUI.
[Parallel Device Config] [Parallel Device Config]
β cuda:0 (50%) β cuda:1 (50%)
ββββββββββββββββ [Parallel Anything] βββββββββββββββ
β
[Your KSampler/etc]- Add Parallel Device Config node β Select cuda:0 from dropdown β Set 50%
- Add another Parallel Device Config β Connect DEVICE_CHAIN output from first node β Select cuda:1 β Set 50%
- Connect final DEVICE_CHAIN to Parallel Anything node
- Connect your MODEL from Load Checkpoint β Parallel Anything β KSampler
Chain 4 devices with different percentages based on GPU memory:
cuda:0 (RTX 3090): 40% cuda:1 (RTX 3090): 40% cuda:2 (V100): 15% cuda:3 (P100): 5%
Use Parallel Device List (1-4x) if you prefer a single node with 4 dropdowns instead of chaining.
[Parallel Device Config: cpu (20%)] β [Parallel Device Config: cuda:0 (80%)] β [Parallel Anything]
- PCIe Bandwidth Matters Ensure GPUs share the same PCIe switch or CPU root complex:
# Linux: Check PCIe topology
lspci -tv | grep -i nvidiaAvoid configurations where GPUs are on separate NUMA nodes with limited inter-socket bandwidth.
- Batch Size Optimization
- Minimum: Batch size β₯ Number of GPUs
- Sweet Spot: Batch size 8-16 for 2-4 GPUs
- Diminishing Returns: Very large batches may saturate PCIe transfer bandwidth
- Identical GPUs Preferred
- Mixing GPU architectures: (e.g., RTX 4090 + RTX 3090) works but the faster GPU will wait for the slower one at each step. Use percentage weights to compensate (e.g., 60/40 split).
- Model Placement Place Parallel Anything immediately before the KSampler, after all LoRA/weight modifications:
Load Checkpoint β Load LoRA β [Parallel Anything] β KSampler- VRAM Usage: This node uses Model Replication. If you use 2 GPUs, you will use 2 times the VRAM (one copy per card).
- Batch Size: Parallelism only triggers if your Batch Size is > the number of devices in your chain.
- Inference Tensors: The node automatically clones and detaches tensors to bypass PyTorch's "Inference tensors do not track version counter" error common in multi-GPU workflows.
If you encounter a RuntimeError regarding "Inference Tensors":
- Ensure you are using a Batch Size large enough to split.
- The node uses a "Deep Detach" strategy (
.detach().clone()) to satisfy the version counter requirements of the KSampler. - if you see the message "RuntimeError: Expected all tensors to be on the same device, but got mat1 is on cuda:0, different from other tensors on cuda:1 (when checking argument in method wrapper_CUDA_addmm) " after changing the percentage of the GPU'S for the second run. restart comfyui
-
PCIe Bottleneck: Data transfer overhead exceeds compute benefit (common with x4/x8 PCIe slots)
-
Small Batch Size: Overhead of splitting/merging exceeds parallel benefit. Try batch size β₯ 8.
-
Mixed GPUs: Fast GPU waiting for slow GPU. Adjust percentages.
-
Apple MPS (Mac) Issues MPS backend does not support all operations needed for stable diffusion. If you encounter errors:
-
Use CPU exclusively on Mac for stability
-
MPS support is experimental
If you see CUDA error: invalid device ordinal or similar:
- Ensure you're not wrapping the model twice (check for nested Parallel Anything nodes)
- Verify all selected devices exist: torch.device('cuda:2') will fail if you only have 2 GPUs (indices 0 and 1)
This node implements Data Parallelism via model replication:
- Replication: On setup, the model state dict is deep-copied to N devices
- Batch Split: Input batch is divided by percentage weights
- Thread Pool: Each chunk is processed in parallel using ThreadPoolExecutor
- Synchronization: torch.cuda.synchronize() ensures computation completes before returning to lead device
- Concatenation: Results are gathered and concatenated on the lead device
- Trade-off: Uses NΓ VRAM for NΓ throughput (approximately). Best for multi-GPU workstations with identical GPUs.
- β No model parallelism (splitting layers across GPUs) - Each GPU holds full model
- β No gradient synchronization - Inference only (no training/fine-tuning)
- β Static load balancing - Percentages fixed per run, no dynamic adjustment based on queue depth
- β Memory overhead - Briefly uses 2Γ model memory per GPU during the load_state_dict phase
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
