[Features]Add Domain-Level Document Discovery for WebSearch

## 🎯 **Objective**
Extend the existing WebSearch functionality to support domain-level document crawling with LLM-optimized content discovery. This feature prioritizes AI-friendly indexes (like llms.txt) while providing fallback mechanisms for comprehensive content coverage.

## 📋 **Background & Requirements**

### **Why This Feature?**
- **LLM Application Needs**: AI agents require domain-level knowledge extraction from websites
- **Emerging Standards**: Support for llms.txt and other AI-optimized content indexes
- **Existing Pain Points**: Current WebSearch only handles single queries, lacks domain-level discovery
- **AI/MCP Priority**: Designed primarily for AI agents and MCP tool calls

### **Design Principles**
- **Reuse Existing Infrastructure**: Extend WebSearchRequest without new HTTP endpoints
- **Minimize Parameters**: Keep interface simple with smart defaults
- **Progressive Discovery**: Use priority-based content discovery strategies
- **Flexible Control**: Allow users to control domain scope precisely

## 🏗️ **Technical Design**

### **Discovery Strategy (Fixed Priority Order)**
```
Level 1 - LLM-Optimized Indexes (Highest Priority):
├── /llms.txt
├── /.well-known/llms.txt  
├── /llms-full.txt
└── /.well-known/llms-full.txt

Level 2 - Search Engine Discovery (Fallback):
├── DuckDuckGo site:domain search
├── DuckDuckGo site:domain + intent keywords
└── Comprehensive coverage including deep pages
```

**Why This Two-Level Approach?**
- **Level 1**: AI-era standards, highest content quality
- **Level 2**: Search engines already index based on sitemaps and more, broader coverage
- **Simplified**: Removed sitemap.xml level to avoid redundancy with search engines

### **API Interface Extension**

```yaml
WebSearchRequest (New Fields):
  # Existing fields remain unchanged
  query: string
  max_results: int = 5
  search_engine: string = "duckduckgo"  
  timeout: int = 30
  locale: string = "zh-CN"
  
  # New fields
  sources?: string[]                    # Domain and URL list for unified processing
  use_sources_domain_only: bool = false # Strict domain limitation flag
```

### **Processing Logic**

#### **Unified Sources Processing**
```yaml
For each item in sources:
  1. Extract domain:
     - If URL: extract domain part (e.g., "vercel.com" from "https://vercel.com/docs/llms.txt")
     - If domain: use directly
     
  2. Content discovery (fixed order):
     - Try domain/llms.txt
     - Try domain/.well-known/llms.txt  
     - Try domain/llms-full.txt
     - Try domain/.well-known/llms-full.txt
     - DuckDuckGo site:domain search
```

#### **Domain Scope Control**
```yaml
use_sources_domain_only behavior:
  - sources empty: use_sources_domain_only has no effect, normal search
  - sources provided + use_sources_domain_only=false: 
    * Process sources domains first
    * DuckDuckGo may return results from other domains
  - sources provided + use_sources_domain_only=true:
    * Only process domains from sources
    * Filter out all results from non-sources domains
```

## 📝 **Usage Examples**

### **Example 1: Mixed Domains and URLs with Expansion**
```yaml
Request:
  query: "API documentation"
  sources: [
    "vercel.com",
    "https://modelcontextprotocol.io/llms-full.txt",
    "docs.anthropic.com"
  ]
  use_sources_domain_only: false

Behavior:
  - Process: vercel.com, modelcontextprotocol.io, docs.anthropic.com
  - DuckDuckGo may return additional relevant domains
```

### **Example 2: Strict Domain Limitation**
```yaml
Request:
  query: "Python tutorials"
  sources: ["fastapi.tiangolo.com", "docs.python.org"]  
  use_sources_domain_only: true

Behavior:
  - Only returns content from fastapi.tiangolo.com and docs.python.org
  - All other domain results filtered out
```

### **Example 3: Backward Compatibility**
```yaml
Request:
  query: "machine learning"
  sources: []
  use_sources_domain_only: false

Behavior:
  - use_sources_domain_only has no effect when sources is empty
  - Works exactly like current WebSearch behavior
```

### **Example 4: Direct LLM.txt with Domain Discovery**
```yaml
Request:
  query: "deployment guides"
  sources: ["https://vercel.com/docs/llms.txt"]
  use_sources_domain_only: false

Behavior:
  - Read specified llms.txt URL directly
  - Also perform full discovery on vercel.com domain
  - DuckDuckGo may return results from other domains
```

## 🔧 **Implementation Layers**

### **Service Layer (WebSearchService)**
```python
class WebSearchService:
    async def search_with_domain_discovery(
        self,
        query: str,
        sources: List[str] = None,
        use_sources_domain_only: bool = False,
        **kwargs
    ) -> WebSearchResponse:
        # 1. Extract domains from sources
        # 2. Try LLM.txt discovery for each domain
        # 3. Fallback to DuckDuckGo site: search
        # 4. Apply domain filtering if use_sources_domain_only=true
```

### **HTTP Layer (WebSearchRequest)**
```python
class WebSearchRequest(BaseModel):
    # ... existing fields ...
    sources: Optional[List[str]] = None
    use_sources_domain_only: bool = False
```

### **MCP Layer**
```python
# MCP tool will automatically support new parameters
@tool
def web_search(
    query: str,
    sources: List[str] = None,
    use_sources_domain_only: bool = False,
    **kwargs
) -> dict:
    # Direct mapping to service layer
```

## ✅ **Acceptance Criteria**

1. **Parameter Support**: WebSearchRequest accepts `sources` and `use_sources_domain_only`
2. **Domain Extraction**: Correctly extract domains from both URLs and domain strings
3. **LLM.txt Discovery**: Try all 4 LLM.txt path patterns for each domain
4. **Search Engine Integration**: Use existing DuckDuckGo provider for site: searches
5. **Domain Filtering**: Respect `use_sources_domain_only` flag for result filtering
6. **Backward Compatibility**: Existing WebSearch behavior unchanged when new parameters not used
7. **Error Handling**: Graceful fallback when LLM.txt files not found or inaccessible
8. **Content Quality**: Maintain existing content extraction and processing quality
9. **Rate Limiting**: Respect existing rate limiting and timeout configurations
10. **MCP Integration**: New parameters automatically available in MCP tools

## 🧪 **Testing Strategy**

### **Unit Tests**
- Domain extraction from various URL formats
- LLM.txt path generation and validation
- Domain filtering logic
- Error handling for unreachable domains

### **Integration Tests**
- End-to-end domain discovery workflows
- DuckDuckGo site: search integration
- Content extraction and quality validation
- MCP tool parameter passing

### **Edge Cases**
- Empty sources list behavior
- Invalid domain/URL formats
- Network timeouts and failures
- Large domain lists performance

## 🚀 **Benefits**

1. **Enhanced AI Capabilities**: Better domain-level knowledge extraction for LLM applications
2. **Future-Ready**: Support for emerging AI content standards (llms.txt)
3. **Flexible Control**: Users can precisely control content scope and sources
4. **Minimal Disruption**: Extends existing interface without breaking changes
5. **Intelligent Fallbacks**: Progressive discovery ensures comprehensive coverage
6. **Production Ready**: Built on existing, battle-tested WebSearch infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Features]Add Domain-Level Document Discovery for WebSearch #1097

🎯 Objective

📋 Background & Requirements

Why This Feature?

Design Principles

🏗️ Technical Design

Discovery Strategy (Fixed Priority Order)

API Interface Extension

Processing Logic

Unified Sources Processing

Domain Scope Control

📝 Usage Examples

Example 1: Mixed Domains and URLs with Expansion

Example 2: Strict Domain Limitation

Example 3: Backward Compatibility

Example 4: Direct LLM.txt with Domain Discovery

🔧 Implementation Layers

Service Layer (WebSearchService)

HTTP Layer (WebSearchRequest)

MCP Layer

✅ Acceptance Criteria

🧪 Testing Strategy

Unit Tests

Integration Tests

Edge Cases

🚀 Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Features]Add Domain-Level Document Discovery for WebSearch #1097

Description

🎯 Objective

📋 Background & Requirements

Why This Feature?

Design Principles

🏗️ Technical Design

Discovery Strategy (Fixed Priority Order)

API Interface Extension

Processing Logic

Unified Sources Processing

Domain Scope Control

📝 Usage Examples

Example 1: Mixed Domains and URLs with Expansion

Example 2: Strict Domain Limitation

Example 3: Backward Compatibility

Example 4: Direct LLM.txt with Domain Discovery

🔧 Implementation Layers

Service Layer (WebSearchService)

HTTP Layer (WebSearchRequest)

MCP Layer

✅ Acceptance Criteria

🧪 Testing Strategy

Unit Tests

Integration Tests

Edge Cases

🚀 Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions