[RFC] Multimodal Task Delegation #615

afjcjsbx · 2026-02-22T00:32:01Z

afjcjsbx
Feb 22, 2026

Hi everyone! 🦞
I'm working on making PicoClaw a true multimodal orchestrator and would love your input on an architectural pattern regarding task delegation.

The Context:
The goal is to allow the agent to automatically delegate specific tasks (like analyzing an image, transcribing audio, or writing production code) to the most suitable models (e.g., Claude 3.5 for vision, Whisper for audio, etc.).

What I've implemented so far:

I've added a capabilities: ["vision", "audio", "coding"] field to the models in config.json.
I've extended pkg/providers/types.go by creating specific interfaces (VisionProvider, AudioProvider, etc.) implemented only by the providers that support them, keeping the base LLMProvider interface clean.
I've created a universal tool delegate_task(capability, prompt, file_path). This tool reads the request, uses the Factory to instantiate the correct model, performs a safe Type Assertion in Go, and returns the result.

The Architectural Dilemma (Agentic Design):
How do we let the agent know which "superpowers" it has available in its cluster? I've evaluated 2 paths:

Approach 1: Two separate Tools (Introspection + Delegation)
Provide the agent with a list_capabilities tool. When the user sends a photo, the agent first uses this tool to figure out if it has vision-capable models among the configured ones, and then uses delegate_task.

Pros: Very clean and untouched System Prompt.
Cons: Extremely high latency. It requires 2 full LLM roundtrips (double the time and cost) for a single action.

Approach 2: Dynamic Injection into the System Prompt (The approach I chose)
The ContextBuilder extracts the unique capabilities from the loaded models and dynamically injects them into the System Prompt at startup (e.g., Available capabilities: [vision, audio]). The agent can call delegate_task immediately.

Pros: Maximum responsiveness (only 1 roundtrip) and deterministic behavior. Single source of truth in config.json.
Cons: Consumes a few more fixed tokens in the System Prompt.

Or perhaps neither approach is optimal for solving this agentic routing intelligently?

Personally, I believe Approach 2 is the perfect compromise between speed (low latency for the user) and clean Go code (safe Type Assertions).

What do you guys think? Is this the right pattern for the future of PicoClaw, or do you suggest other paths (e.g., Model Context Protocol)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Multimodal Task Delegation #615

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC] Multimodal Task Delegation #615

Uh oh!

Uh oh!

afjcjsbx Feb 22, 2026

Replies: 0 comments

afjcjsbx
Feb 22, 2026