Skip to content

[Features]Add Domain-Level Document Discovery for WebSearch #1097

@earayu

Description

@earayu

🎯 Objective

Extend the existing WebSearch functionality to support domain-level document crawling with LLM-optimized content discovery. This feature prioritizes AI-friendly indexes (like llms.txt) while providing fallback mechanisms for comprehensive content coverage.

📋 Background & Requirements

Why This Feature?

  • LLM Application Needs: AI agents require domain-level knowledge extraction from websites
  • Emerging Standards: Support for llms.txt and other AI-optimized content indexes
  • Existing Pain Points: Current WebSearch only handles single queries, lacks domain-level discovery
  • AI/MCP Priority: Designed primarily for AI agents and MCP tool calls

Design Principles

  • Reuse Existing Infrastructure: Extend WebSearchRequest without new HTTP endpoints
  • Minimize Parameters: Keep interface simple with smart defaults
  • Progressive Discovery: Use priority-based content discovery strategies
  • Flexible Control: Allow users to control domain scope precisely

🏗️ Technical Design

Discovery Strategy (Fixed Priority Order)

Level 1 - LLM-Optimized Indexes (Highest Priority):
├── /llms.txt
├── /.well-known/llms.txt  
├── /llms-full.txt
└── /.well-known/llms-full.txt

Level 2 - Search Engine Discovery (Fallback):
├── DuckDuckGo site:domain search
├── DuckDuckGo site:domain + intent keywords
└── Comprehensive coverage including deep pages

Why This Two-Level Approach?

  • Level 1: AI-era standards, highest content quality
  • Level 2: Search engines already index based on sitemaps and more, broader coverage
  • Simplified: Removed sitemap.xml level to avoid redundancy with search engines

API Interface Extension

WebSearchRequest (New Fields):
  # Existing fields remain unchanged
  query: string
  max_results: int = 5
  search_engine: string = "duckduckgo"  
  timeout: int = 30
  locale: string = "zh-CN"
  
  # New fields
  sources?: string[]                    # Domain and URL list for unified processing
  use_sources_domain_only: bool = false # Strict domain limitation flag

Processing Logic

Unified Sources Processing

For each item in sources:
  1. Extract domain:
     - If URL: extract domain part (e.g., "vercel.com" from "https://vercel.com/docs/llms.txt")
     - If domain: use directly
     
  2. Content discovery (fixed order):
     - Try domain/llms.txt
     - Try domain/.well-known/llms.txt  
     - Try domain/llms-full.txt
     - Try domain/.well-known/llms-full.txt
     - DuckDuckGo site:domain search

Domain Scope Control

use_sources_domain_only behavior:
  - sources empty: use_sources_domain_only has no effect, normal search
  - sources provided + use_sources_domain_only=false: 
    * Process sources domains first
    * DuckDuckGo may return results from other domains
  - sources provided + use_sources_domain_only=true:
    * Only process domains from sources
    * Filter out all results from non-sources domains

📝 Usage Examples

Example 1: Mixed Domains and URLs with Expansion

Request:
  query: "API documentation"
  sources: [
    "vercel.com",
    "https://modelcontextprotocol.io/llms-full.txt",
    "docs.anthropic.com"
  ]
  use_sources_domain_only: false

Behavior:
  - Process: vercel.com, modelcontextprotocol.io, docs.anthropic.com
  - DuckDuckGo may return additional relevant domains

Example 2: Strict Domain Limitation

Request:
  query: "Python tutorials"
  sources: ["fastapi.tiangolo.com", "docs.python.org"]  
  use_sources_domain_only: true

Behavior:
  - Only returns content from fastapi.tiangolo.com and docs.python.org
  - All other domain results filtered out

Example 3: Backward Compatibility

Request:
  query: "machine learning"
  sources: []
  use_sources_domain_only: false

Behavior:
  - use_sources_domain_only has no effect when sources is empty
  - Works exactly like current WebSearch behavior

Example 4: Direct LLM.txt with Domain Discovery

Request:
  query: "deployment guides"
  sources: ["https://vercel.com/docs/llms.txt"]
  use_sources_domain_only: false

Behavior:
  - Read specified llms.txt URL directly
  - Also perform full discovery on vercel.com domain
  - DuckDuckGo may return results from other domains

🔧 Implementation Layers

Service Layer (WebSearchService)

class WebSearchService:
    async def search_with_domain_discovery(
        self,
        query: str,
        sources: List[str] = None,
        use_sources_domain_only: bool = False,
        **kwargs
    ) -> WebSearchResponse:
        # 1. Extract domains from sources
        # 2. Try LLM.txt discovery for each domain
        # 3. Fallback to DuckDuckGo site: search
        # 4. Apply domain filtering if use_sources_domain_only=true

HTTP Layer (WebSearchRequest)

class WebSearchRequest(BaseModel):
    # ... existing fields ...
    sources: Optional[List[str]] = None
    use_sources_domain_only: bool = False

MCP Layer

# MCP tool will automatically support new parameters
@tool
def web_search(
    query: str,
    sources: List[str] = None,
    use_sources_domain_only: bool = False,
    **kwargs
) -> dict:
    # Direct mapping to service layer

Acceptance Criteria

  1. Parameter Support: WebSearchRequest accepts sources and use_sources_domain_only
  2. Domain Extraction: Correctly extract domains from both URLs and domain strings
  3. LLM.txt Discovery: Try all 4 LLM.txt path patterns for each domain
  4. Search Engine Integration: Use existing DuckDuckGo provider for site: searches
  5. Domain Filtering: Respect use_sources_domain_only flag for result filtering
  6. Backward Compatibility: Existing WebSearch behavior unchanged when new parameters not used
  7. Error Handling: Graceful fallback when LLM.txt files not found or inaccessible
  8. Content Quality: Maintain existing content extraction and processing quality
  9. Rate Limiting: Respect existing rate limiting and timeout configurations
  10. MCP Integration: New parameters automatically available in MCP tools

🧪 Testing Strategy

Unit Tests

  • Domain extraction from various URL formats
  • LLM.txt path generation and validation
  • Domain filtering logic
  • Error handling for unreachable domains

Integration Tests

  • End-to-end domain discovery workflows
  • DuckDuckGo site: search integration
  • Content extraction and quality validation
  • MCP tool parameter passing

Edge Cases

  • Empty sources list behavior
  • Invalid domain/URL formats
  • Network timeouts and failures
  • Large domain lists performance

🚀 Benefits

  1. Enhanced AI Capabilities: Better domain-level knowledge extraction for LLM applications
  2. Future-Ready: Support for emerging AI content standards (llms.txt)
  3. Flexible Control: Users can precisely control content scope and sources
  4. Minimal Disruption: Extends existing interface without breaking changes
  5. Intelligent Fallbacks: Progressive discovery ensures comprehensive coverage
  6. Production Ready: Built on existing, battle-tested WebSearch infrastructure

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions