-
Notifications
You must be signed in to change notification settings - Fork 118
Closed
Labels
Description
🎯 Objective
Extend the existing WebSearch functionality to support domain-level document crawling with LLM-optimized content discovery. This feature prioritizes AI-friendly indexes (like llms.txt) while providing fallback mechanisms for comprehensive content coverage.
📋 Background & Requirements
Why This Feature?
- LLM Application Needs: AI agents require domain-level knowledge extraction from websites
- Emerging Standards: Support for llms.txt and other AI-optimized content indexes
- Existing Pain Points: Current WebSearch only handles single queries, lacks domain-level discovery
- AI/MCP Priority: Designed primarily for AI agents and MCP tool calls
Design Principles
- Reuse Existing Infrastructure: Extend WebSearchRequest without new HTTP endpoints
- Minimize Parameters: Keep interface simple with smart defaults
- Progressive Discovery: Use priority-based content discovery strategies
- Flexible Control: Allow users to control domain scope precisely
🏗️ Technical Design
Discovery Strategy (Fixed Priority Order)
Level 1 - LLM-Optimized Indexes (Highest Priority):
├── /llms.txt
├── /.well-known/llms.txt
├── /llms-full.txt
└── /.well-known/llms-full.txt
Level 2 - Search Engine Discovery (Fallback):
├── DuckDuckGo site:domain search
├── DuckDuckGo site:domain + intent keywords
└── Comprehensive coverage including deep pages
Why This Two-Level Approach?
- Level 1: AI-era standards, highest content quality
- Level 2: Search engines already index based on sitemaps and more, broader coverage
- Simplified: Removed sitemap.xml level to avoid redundancy with search engines
API Interface Extension
WebSearchRequest (New Fields):
# Existing fields remain unchanged
query: string
max_results: int = 5
search_engine: string = "duckduckgo"
timeout: int = 30
locale: string = "zh-CN"
# New fields
sources?: string[] # Domain and URL list for unified processing
use_sources_domain_only: bool = false # Strict domain limitation flagProcessing Logic
Unified Sources Processing
For each item in sources:
1. Extract domain:
- If URL: extract domain part (e.g., "vercel.com" from "https://vercel.com/docs/llms.txt")
- If domain: use directly
2. Content discovery (fixed order):
- Try domain/llms.txt
- Try domain/.well-known/llms.txt
- Try domain/llms-full.txt
- Try domain/.well-known/llms-full.txt
- DuckDuckGo site:domain searchDomain Scope Control
use_sources_domain_only behavior:
- sources empty: use_sources_domain_only has no effect, normal search
- sources provided + use_sources_domain_only=false:
* Process sources domains first
* DuckDuckGo may return results from other domains
- sources provided + use_sources_domain_only=true:
* Only process domains from sources
* Filter out all results from non-sources domains📝 Usage Examples
Example 1: Mixed Domains and URLs with Expansion
Request:
query: "API documentation"
sources: [
"vercel.com",
"https://modelcontextprotocol.io/llms-full.txt",
"docs.anthropic.com"
]
use_sources_domain_only: false
Behavior:
- Process: vercel.com, modelcontextprotocol.io, docs.anthropic.com
- DuckDuckGo may return additional relevant domainsExample 2: Strict Domain Limitation
Request:
query: "Python tutorials"
sources: ["fastapi.tiangolo.com", "docs.python.org"]
use_sources_domain_only: true
Behavior:
- Only returns content from fastapi.tiangolo.com and docs.python.org
- All other domain results filtered outExample 3: Backward Compatibility
Request:
query: "machine learning"
sources: []
use_sources_domain_only: false
Behavior:
- use_sources_domain_only has no effect when sources is empty
- Works exactly like current WebSearch behaviorExample 4: Direct LLM.txt with Domain Discovery
Request:
query: "deployment guides"
sources: ["https://vercel.com/docs/llms.txt"]
use_sources_domain_only: false
Behavior:
- Read specified llms.txt URL directly
- Also perform full discovery on vercel.com domain
- DuckDuckGo may return results from other domains🔧 Implementation Layers
Service Layer (WebSearchService)
class WebSearchService:
async def search_with_domain_discovery(
self,
query: str,
sources: List[str] = None,
use_sources_domain_only: bool = False,
**kwargs
) -> WebSearchResponse:
# 1. Extract domains from sources
# 2. Try LLM.txt discovery for each domain
# 3. Fallback to DuckDuckGo site: search
# 4. Apply domain filtering if use_sources_domain_only=trueHTTP Layer (WebSearchRequest)
class WebSearchRequest(BaseModel):
# ... existing fields ...
sources: Optional[List[str]] = None
use_sources_domain_only: bool = FalseMCP Layer
# MCP tool will automatically support new parameters
@tool
def web_search(
query: str,
sources: List[str] = None,
use_sources_domain_only: bool = False,
**kwargs
) -> dict:
# Direct mapping to service layer✅ Acceptance Criteria
- Parameter Support: WebSearchRequest accepts
sourcesanduse_sources_domain_only - Domain Extraction: Correctly extract domains from both URLs and domain strings
- LLM.txt Discovery: Try all 4 LLM.txt path patterns for each domain
- Search Engine Integration: Use existing DuckDuckGo provider for site: searches
- Domain Filtering: Respect
use_sources_domain_onlyflag for result filtering - Backward Compatibility: Existing WebSearch behavior unchanged when new parameters not used
- Error Handling: Graceful fallback when LLM.txt files not found or inaccessible
- Content Quality: Maintain existing content extraction and processing quality
- Rate Limiting: Respect existing rate limiting and timeout configurations
- MCP Integration: New parameters automatically available in MCP tools
🧪 Testing Strategy
Unit Tests
- Domain extraction from various URL formats
- LLM.txt path generation and validation
- Domain filtering logic
- Error handling for unreachable domains
Integration Tests
- End-to-end domain discovery workflows
- DuckDuckGo site: search integration
- Content extraction and quality validation
- MCP tool parameter passing
Edge Cases
- Empty sources list behavior
- Invalid domain/URL formats
- Network timeouts and failures
- Large domain lists performance
🚀 Benefits
- Enhanced AI Capabilities: Better domain-level knowledge extraction for LLM applications
- Future-Ready: Support for emerging AI content standards (llms.txt)
- Flexible Control: Users can precisely control content scope and sources
- Minimal Disruption: Extends existing interface without breaking changes
- Intelligent Fallbacks: Progressive discovery ensures comprehensive coverage
- Production Ready: Built on existing, battle-tested WebSearch infrastructure
Reactions are currently unavailable