You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone! 🦞
I'm working on making PicoClaw a true multimodal orchestrator and would love your input on an architectural pattern regarding task delegation.
The Context:
The goal is to allow the agent to automatically delegate specific tasks (like analyzing an image, transcribing audio, or writing production code) to the most suitable models (e.g., Claude 3.5 for vision, Whisper for audio, etc.).
What I've implemented so far:
I've added a capabilities: ["vision", "audio", "coding"] field to the models in config.json.
I've extended pkg/providers/types.go by creating specific interfaces (VisionProvider, AudioProvider, etc.) implemented only by the providers that support them, keeping the base LLMProvider interface clean.
I've created a universal tool delegate_task(capability, prompt, file_path). This tool reads the request, uses the Factory to instantiate the correct model, performs a safe Type Assertion in Go, and returns the result.
The Architectural Dilemma (Agentic Design):
How do we let the agent know which "superpowers" it has available in its cluster? I've evaluated 2 paths:
Approach 1: Two separate Tools (Introspection + Delegation)
Provide the agent with a list_capabilities tool. When the user sends a photo, the agent first uses this tool to figure out if it has vision-capable models among the configured ones, and then uses delegate_task.
Pros: Very clean and untouched System Prompt.
Cons: Extremely high latency. It requires 2 full LLM roundtrips (double the time and cost) for a single action.
Approach 2: Dynamic Injection into the System Prompt (The approach I chose)
The ContextBuilder extracts the unique capabilities from the loaded models and dynamically injects them into the System Prompt at startup (e.g., Available capabilities: [vision, audio]). The agent can call delegate_task immediately.
Pros: Maximum responsiveness (only 1 roundtrip) and deterministic behavior. Single source of truth in config.json.
Cons: Consumes a few more fixed tokens in the System Prompt.
Or perhaps neither approach is optimal for solving this agentic routing intelligently?
Personally, I believe Approach 2 is the perfect compromise between speed (low latency for the user) and clean Go code (safe Type Assertions).
What do you guys think? Is this the right pattern for the future of PicoClaw, or do you suggest other paths (e.g., Model Context Protocol)?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone! 🦞
I'm working on making PicoClaw a true multimodal orchestrator and would love your input on an architectural pattern regarding task delegation.
The Context:
The goal is to allow the agent to automatically delegate specific tasks (like analyzing an image, transcribing audio, or writing production code) to the most suitable models (e.g., Claude 3.5 for vision, Whisper for audio, etc.).
What I've implemented so far:
capabilities: ["vision", "audio", "coding"]field to the models inconfig.json.pkg/providers/types.goby creating specific interfaces (VisionProvider,AudioProvider, etc.) implemented only by the providers that support them, keeping the baseLLMProviderinterface clean.delegate_task(capability, prompt, file_path). This tool reads the request, uses the Factory to instantiate the correct model, performs a safe Type Assertion in Go, and returns the result.The Architectural Dilemma (Agentic Design):
How do we let the agent know which "superpowers" it has available in its cluster? I've evaluated 2 paths:
Approach 1: Two separate Tools (Introspection + Delegation)
Provide the agent with a
list_capabilitiestool. When the user sends a photo, the agent first uses this tool to figure out if it has vision-capable models among the configured ones, and then usesdelegate_task.Approach 2: Dynamic Injection into the System Prompt (The approach I chose)
The
ContextBuilderextracts the unique capabilities from the loaded models and dynamically injects them into the System Prompt at startup (e.g.,Available capabilities: [vision, audio]). The agent can calldelegate_taskimmediately.config.json.Or perhaps neither approach is optimal for solving this agentic routing intelligently?
Personally, I believe Approach 2 is the perfect compromise between speed (low latency for the user) and clean Go code (safe Type Assertions).
What do you guys think? Is this the right pattern for the future of PicoClaw, or do you suggest other paths (e.g., Model Context Protocol)?
Beta Was this translation helpful? Give feedback.
All reactions