Skip to content

Releases: spider-rs/spider

v2.45.20

05 Feb 21:04

Choose a tag to compare

What's New

Relevance Gate for Remote Multimodal Crawling

Added a relevance_gate config that instructs the LLM to return a "relevant": true|false field in its JSON response. When a page is deemed irrelevant, its wildcard budget credit is refunded so the crawler discovers more relevant content.

New config fields:

  • relevance_gate: bool — enables the feature
  • relevance_prompt: Option<String> — optional custom relevance criteria

How it works:

  1. When enabled, the system prompt instructs the LLM to include "relevant": true|false
  2. If the model returns false, a budget credit is atomically accumulated
  3. Credits are drained in the crawl loop to restore the wildcard budget
  4. Default fallback is true (assume relevant) if the model omits the field

Example:

let cfgs = RemoteMultimodalConfigs::new(api_url, model)
    .with_relevance_gate(Some("Only pages about Rust programming".into()));

Full Changelog

  • feat(agent): add relevance_gate and relevance_prompt to RemoteMultimodalConfig
  • feat(agent): add atomic relevance_credits counter to RemoteMultimodalConfigs
  • feat(agent): add relevant: Option<bool> to AutomationResult and AutomationResults
  • feat(agent): extend system prompt and extraction with relevance gate instructions
  • feat(spider): add restore_wildcard_budget() for budget refund
  • feat(spider): drain relevance credits in crawl loop dequeue

v2.44.13

05 Feb 19:07

Choose a tag to compare

What's New

  • Spider Cloud integration (spider_cloud feature) — optional proxy rotation, anti-bot bypass, and intelligent fallback via spider.cloud
    • Modes: Proxy, Api, Unblocker, Fallback, Smart
    • Smart mode auto-detects Cloudflare challenges, CAPTCHAs, and bot protection then retries via /unblocker
  • S3 skills loading (skills_s3 feature) — load agent skills from S3-compatible storage (AWS, MinIO, R2)
  • CLI: --spider-cloud-key and --spider-cloud-mode flags

Crates

  • spider v2.44.13
  • spider_agent v2.44.13
  • spider_cli v2.44.13
  • spider_utils v2.44.13
  • spider_worker v2.44.13

spider v2.43.20

03 Feb 14:37

Choose a tag to compare

Spider v2.43.20

Changes

  • fix(spider): Fix doctest and update chromey for adblock compatibility
  • fix(search): Use reqwest::Client directly for cache feature compatibility
  • chore(spider): Update spider_agent dependency to 0.4

spider_agent Integration

The agent feature now uses spider_agent v0.4.0, which includes:

  • Smart caching with size-aware LRU eviction
  • High-performance chain execution with parallel step support
  • Batch processing for multiple items
  • Prefetch management for predictive page loading
  • Smart model routing based on task complexity

Full Changelog

v2.43.19...v2.43.20

spider_agent v0.4.0

03 Feb 14:34

Choose a tag to compare

Spider Agent v0.4.0

Performance Optimizations

This release adds several performance optimizations for automation workflows:

Smart Caching

  • SmartCache: Size-aware LRU cache with automatic cleanup
    • Bounded memory usage with configurable limits
    • TTL-based expiration
    • Automatic cleanup on memory pressure
    • Statistics tracking (hits, misses, evictions)

High-Performance Execution

  • ChainExecutor: Parallel step execution for automation chains

    • Analyzes dependencies for optimal parallelization
    • Response caching with TTL
    • Configurable concurrency limits
    • Step timeout support
  • BatchExecutor: Efficient batch processing

    • Process multiple items with configurable batch sizes
    • Parallel execution within batches
    • Index-aware processing option
  • PrefetchManager: Predictive page loading

    • Prefetch URLs in the background
    • Automatic cache management
    • Concurrent prefetch limits

Smart Model Routing

  • ModelRouter: Intelligent model selection based on task complexity
    • Task analysis for complexity scoring
    • User-configurable model policies
    • Cost tier constraints (Low/Medium/High)
    • Latency-aware routing

Other Changes

  • Added MessageContent helper methods: as_text(), full_text(), is_text(), has_images()
  • Default ModelPolicy now allows High tier routing
  • Fixed compilation warnings

Full Changelog

spider_agent-v0.3.0...spider_agent-v0.4.0

v2.43.18 - Web Search Integration

02 Feb 23:25

Choose a tag to compare

Features

Web Search Integration

Add web search capabilities to Spider's RemoteMultimodalEngine with support for multiple search providers.

Supported Providers

  • Serper (search_serper) - Google SERP API
  • Brave (search_brave) - Privacy-focused search
  • Bing (search_bing) - Microsoft Bing Web Search
  • Tavily (search_tavily) - AI-optimized search

New Methods

  • search() - Search the web and return structured results
  • search_and_extract() - Search + fetch pages + LLM extraction
  • research() - Search + extract + synthesize findings into summary

Setup

Cargo.toml

[dependencies]
spider = { version = "2.43.18", features = ["search_serper"] }

Configuration

use spider::configuration::{SearchConfig, SearchProviderType};
use spider::features::automation::RemoteMultimodalEngine;

let mut engine = RemoteMultimodalEngine::new(api_url, model, None);
engine.with_search_config(Some(
    SearchConfig::new(SearchProviderType::Serper, "your-api-key")
        // Optional: custom API endpoint
        .with_api_url("https://custom.api.com/search")
));

// Simple search
let results = engine.search("rust web crawler", None, None).await?;

// Search + extract
let data = engine.search_and_extract(
    "best rust frameworks",
    "Extract name and description",
    None,
    None,
).await?;

// Research with synthesis
use spider::features::automation::ResearchOptions;
let research = engine.research(
    "How do async runtimes work?",
    ResearchOptions::new().with_max_pages(5).with_synthesis(),
    None,
).await?;
println!("Summary: {}", research.summary.unwrap());

Custom API Endpoints

All providers support custom API URLs for self-hosted or alternative endpoints:

SearchConfig::new(SearchProviderType::Brave, "api-key")
    .with_api_url("https://my-brave-proxy.example.com/search")

Full Changelog

v2.43.17...v2.43.18

v2.43.13 - Advanced Agentic Automation

02 Feb 19:29

Choose a tag to compare

🤖 Advanced Agentic Automation Features

This release adds comprehensive agentic automation capabilities to spider, making it a powerful tool for autonomous web interactions.

Phase 1: Simplified Agentic APIs

  • act(page, instruction) - Execute single actions with natural language
  • observe(page) - Analyze page state and get structured observations
  • extract_page(page, prompt, schema) - Extract structured data from pages
  • AutomationMemory - In-memory state management for multi-round automation
  • run_with_memory() - Stateful automation with persistent context

Phase 2: Self-Healing & Discovery

  • SelectorCache - Self-healing selector cache with LRU eviction
  • act_cached(page, instruction, cache) - Actions with automatic selector caching
  • StructuredOutputConfig - Native JSON schema enforcement for reliable outputs
  • extract_structured(page, prompt, config) - Schema-validated data extraction
  • map(page, prompt) - AI-powered URL discovery and categorization
  • MapResult / DiscoveredUrl - Relevance-scored URL discovery

Phase 3: Autonomous Agent Execution

  • execute(page, config) - Full autonomous goal-oriented execution
  • agent(page, goal) - Simple goal execution with defaults
  • agent_extract(page, goal, prompt) - Goal execution with data extraction
  • chain(page, steps) - Sequential action composition with conditions
  • AgentConfig - Comprehensive agent configuration (max_steps, timeout, recovery, etc.)
  • RecoveryStrategy - Error handling strategies (Retry, Alternative, Skip, Abort)
  • ChainStep / ChainCondition - Conditional action execution
  • AgentEvent - Real-time progress tracking events
  • AgentResult / ChainResult - Detailed execution results with history

Example Usage

// Autonomous agent
let config = AgentConfig::new("Find and add the cheapest laptop to cart")
    .with_max_steps(30)
    .with_success_url("/cart")
    .with_extraction("Extract cart total");

let result = engine.execute(&page, config).await?;

// Action chaining
let steps = vec![
    ChainStep::new("click Login"),
    ChainStep::new("type email").when(ChainCondition::ElementExists("#email")),
    ChainStep::new("click Submit").then_extract("Extract any errors"),
];
let result = engine.chain(&page, steps).await?;

// Self-healing cache
let mut cache = SelectorCache::new();
engine.act_cached(&page, "click submit", &mut cache).await?;

Full Changelog

  • feat(automation): add Phase 3 agentic features - autonomous agent, action chaining, error recovery
  • feat(automation): add Phase 2 agentic features - selector cache, structured outputs, map API
  • feat(automation): add simplified agentic APIs - act(), observe(), extract()
  • feat(automation): add agentic memory for multi-round automation

v2.43.3

02 Feb 02:19

Choose a tag to compare

Bug Fix

  • fix(automation): Improve best_effort_parse_json_object parsing to handle LLM responses with reasoning text before JSON code blocks
    • Find ```json blocks anywhere in response (not just at boundaries)
    • Support JSON arrays in addition to objects
    • Better fallback parsing for various LLM response formats

Full Changelog: v2.43.2...v2.43.3

v2.43.2

02 Feb 00:55

Choose a tag to compare

New Feature: Extraction Schema Support

Add JSON Schema support for structured extraction in RemoteMultimodalEngine.

ExtractionSchema Struct

pub struct ExtractionSchema {
    pub name: String,           // Schema name (e.g., "products")
    pub description: Option<String>,  // What to extract
    pub schema: String,         // JSON Schema definition
    pub strict: bool,           // Enforce strict adherence
}

Example Usage

use spider::features::automation::{RemoteMultimodalConfigs, ExtractionSchema};

let schema = ExtractionSchema::new_with_description(
    "products",
    "Extract product information",
    r#"{
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": { "type": "string" },
                        "price": { "type": "number" }
                    },
                    "required": ["name", "price"]
                }
            }
        }
    }"#,
).with_strict(true);

let mm = RemoteMultimodalConfigs::new("http://localhost:11434/v1/chat/completions", "model")
    .with_extra_ai_data(true)
    .with_extraction_schema(Some(schema));

Full Changelog: v2.43.1...v2.43.2

v2.43.1

01 Feb 23:42

Choose a tag to compare

Bug Fix

  • fix(page): Add missing remote_multimodal_usage and extra_remote_multimodal_data fields to the decentralized Page struct for feature parity with the standard Page struct.

Full Changelog: v2.43.0...v2.43.1

v2.43.0

01 Feb 21:40

Choose a tag to compare

What's New

Token Usage Tracking for RemoteMultimodalEngine

The remote multimodal automation engine now tracks and returns token usage conforming to the OpenAI API format:

  • AutomationUsage struct with prompt_tokens, completion_tokens, total_tokens
  • Usage is accumulated across all inference rounds
  • Stored on Page.remote_multimodal_usage

Extraction Support

New extraction capabilities for RemoteMultimodalEngine, similar to the OpenAI integration:

  • extra_ai_data - Enable extraction mode
  • extraction_prompt - Custom extraction instructions
  • screenshot - Capture final screenshot

Extracted data is automatically stored on Page.extra_remote_multimodal_data as AutomationResults.

Example Usage

use spider::features::automation::RemoteMultimodalConfigs;

let mm = RemoteMultimodalConfigs::new(
    "http://localhost:11434/v1/chat/completions",
    "qwen2.5-vl",
)
.with_extra_ai_data(true)
.with_extraction_prompt(Some("Extract all product names and prices"))
.with_screenshot(true);

website.configuration.remote_multimodal = Some(Box::new(mm));

// After crawling, access on page:
for page in website.get_pages().await {
    if let Some(usage) = &page.remote_multimodal_usage {
        println!("Tokens: {:?}", usage);
    }
    if let Some(data) = &page.extra_remote_multimodal_data {
        for result in data {
            println!("Extracted: {:?}", result.content_output);
        }
    }
}

Full Changelog: v2.42.0...v2.43.0