Releases · spider-rs/spider

05 Feb 21:04

j-mendez

v2.45.20

65a32f1

v2.45.20 Latest

Latest

What's New

Relevance Gate for Remote Multimodal Crawling

Added a relevance_gate config that instructs the LLM to return a "relevant": true|false field in its JSON response. When a page is deemed irrelevant, its wildcard budget credit is refunded so the crawler discovers more relevant content.

New config fields:

relevance_gate: bool — enables the feature
relevance_prompt: Option<String> — optional custom relevance criteria

How it works:

When enabled, the system prompt instructs the LLM to include "relevant": true|false
If the model returns false, a budget credit is atomically accumulated
Credits are drained in the crawl loop to restore the wildcard budget
Default fallback is true (assume relevant) if the model omits the field

Example:

let cfgs = RemoteMultimodalConfigs::new(api_url, model)
    .with_relevance_gate(Some("Only pages about Rust programming".into()));

Full Changelog

feat(agent): add relevance_gate and relevance_prompt to RemoteMultimodalConfig
feat(agent): add atomic relevance_credits counter to RemoteMultimodalConfigs
feat(agent): add relevant: Option<bool> to AutomationResult and AutomationResults
feat(agent): extend system prompt and extraction with relevance gate instructions
feat(spider): add restore_wildcard_budget() for budget refund
feat(spider): drain relevance credits in crawl loop dequeue

Assets 2

05 Feb 19:07

j-mendez

v2.44.13

ea0d2dc

v2.44.13

What's New

Spider Cloud integration (spider_cloud feature) — optional proxy rotation, anti-bot bypass, and intelligent fallback via spider.cloud
- Modes: Proxy, Api, Unblocker, Fallback, Smart
- Smart mode auto-detects Cloudflare challenges, CAPTCHAs, and bot protection then retries via /unblocker
S3 skills loading (skills_s3 feature) — load agent skills from S3-compatible storage (AWS, MinIO, R2)
CLI: --spider-cloud-key and --spider-cloud-mode flags

Crates

spider v2.44.13
spider_agent v2.44.13
spider_cli v2.44.13
spider_utils v2.44.13
spider_worker v2.44.13

Assets 2

03 Feb 14:37

j-mendez

v2.43.20

a7a009a

spider v2.43.20

Spider v2.43.20

Changes

fix(spider): Fix doctest and update chromey for adblock compatibility
fix(search): Use reqwest::Client directly for cache feature compatibility
chore(spider): Update spider_agent dependency to 0.4

spider_agent Integration

The agent feature now uses spider_agent v0.4.0, which includes:

Smart caching with size-aware LRU eviction
High-performance chain execution with parallel step support
Batch processing for multiple items
Prefetch management for predictive page loading
Smart model routing based on task complexity

Full Changelog

v2.43.19...v2.43.20

Assets 2

03 Feb 14:34

j-mendez

spider_agent-v0.4.0

f1f28cc

spider_agent v0.4.0

Spider Agent v0.4.0

Performance Optimizations

This release adds several performance optimizations for automation workflows:

Smart Caching

SmartCache: Size-aware LRU cache with automatic cleanup
- Bounded memory usage with configurable limits
- TTL-based expiration
- Automatic cleanup on memory pressure
- Statistics tracking (hits, misses, evictions)

High-Performance Execution

ChainExecutor: Parallel step execution for automation chains
- Analyzes dependencies for optimal parallelization
- Response caching with TTL
- Configurable concurrency limits
- Step timeout support
BatchExecutor: Efficient batch processing
- Process multiple items with configurable batch sizes
- Parallel execution within batches
- Index-aware processing option
PrefetchManager: Predictive page loading
- Prefetch URLs in the background
- Automatic cache management
- Concurrent prefetch limits

Smart Model Routing

ModelRouter: Intelligent model selection based on task complexity
- Task analysis for complexity scoring
- User-configurable model policies
- Cost tier constraints (Low/Medium/High)
- Latency-aware routing

Other Changes

Added MessageContent helper methods: as_text(), full_text(), is_text(), has_images()
Default ModelPolicy now allows High tier routing
Fixed compilation warnings

Full Changelog

spider_agent-v0.3.0...spider_agent-v0.4.0

Assets 2

02 Feb 23:25

j-mendez

v2.43.18

125268a

v2.43.18 - Web Search Integration

Features

Web Search Integration

Add web search capabilities to Spider's RemoteMultimodalEngine with support for multiple search providers.

Supported Providers

Serper (search_serper) - Google SERP API
Brave (search_brave) - Privacy-focused search
Bing (search_bing) - Microsoft Bing Web Search
Tavily (search_tavily) - AI-optimized search

New Methods

search() - Search the web and return structured results
search_and_extract() - Search + fetch pages + LLM extraction
research() - Search + extract + synthesize findings into summary

Setup

Cargo.toml

[dependencies]
spider = { version = "2.43.18", features = ["search_serper"] }

Configuration

use spider::configuration::{SearchConfig, SearchProviderType};
use spider::features::automation::RemoteMultimodalEngine;

let mut engine = RemoteMultimodalEngine::new(api_url, model, None);
engine.with_search_config(Some(
    SearchConfig::new(SearchProviderType::Serper, "your-api-key")
        // Optional: custom API endpoint
        .with_api_url("https://custom.api.com/search")
));

// Simple search
let results = engine.search("rust web crawler", None, None).await?;

// Search + extract
let data = engine.search_and_extract(
    "best rust frameworks",
    "Extract name and description",
    None,
    None,
).await?;

// Research with synthesis
use spider::features::automation::ResearchOptions;
let research = engine.research(
    "How do async runtimes work?",
    ResearchOptions::new().with_max_pages(5).with_synthesis(),
    None,
).await?;
println!("Summary: {}", research.summary.unwrap());

Custom API Endpoints

All providers support custom API URLs for self-hosted or alternative endpoints:

SearchConfig::new(SearchProviderType::Brave, "api-key")
    .with_api_url("https://my-brave-proxy.example.com/search")

Full Changelog

v2.43.17...v2.43.18

Assets 2

02 Feb 19:29

j-mendez

v2.43.13

c8a8176

v2.43.13 - Advanced Agentic Automation

🤖 Advanced Agentic Automation Features

This release adds comprehensive agentic automation capabilities to spider, making it a powerful tool for autonomous web interactions.

Phase 1: Simplified Agentic APIs

act(page, instruction) - Execute single actions with natural language
observe(page) - Analyze page state and get structured observations
extract_page(page, prompt, schema) - Extract structured data from pages
AutomationMemory - In-memory state management for multi-round automation
run_with_memory() - Stateful automation with persistent context

Phase 2: Self-Healing & Discovery

SelectorCache - Self-healing selector cache with LRU eviction
act_cached(page, instruction, cache) - Actions with automatic selector caching
StructuredOutputConfig - Native JSON schema enforcement for reliable outputs
extract_structured(page, prompt, config) - Schema-validated data extraction
map(page, prompt) - AI-powered URL discovery and categorization
MapResult / DiscoveredUrl - Relevance-scored URL discovery

Phase 3: Autonomous Agent Execution

execute(page, config) - Full autonomous goal-oriented execution
agent(page, goal) - Simple goal execution with defaults
agent_extract(page, goal, prompt) - Goal execution with data extraction
chain(page, steps) - Sequential action composition with conditions
AgentConfig - Comprehensive agent configuration (max_steps, timeout, recovery, etc.)
RecoveryStrategy - Error handling strategies (Retry, Alternative, Skip, Abort)
ChainStep / ChainCondition - Conditional action execution
AgentEvent - Real-time progress tracking events
AgentResult / ChainResult - Detailed execution results with history

Example Usage

// Autonomous agent
let config = AgentConfig::new("Find and add the cheapest laptop to cart")
    .with_max_steps(30)
    .with_success_url("/cart")
    .with_extraction("Extract cart total");

let result = engine.execute(&page, config).await?;

// Action chaining
let steps = vec![
    ChainStep::new("click Login"),
    ChainStep::new("type email").when(ChainCondition::ElementExists("#email")),
    ChainStep::new("click Submit").then_extract("Extract any errors"),
];
let result = engine.chain(&page, steps).await?;

// Self-healing cache
let mut cache = SelectorCache::new();
engine.act_cached(&page, "click submit", &mut cache).await?;

Full Changelog

feat(automation): add Phase 3 agentic features - autonomous agent, action chaining, error recovery
feat(automation): add Phase 2 agentic features - selector cache, structured outputs, map API
feat(automation): add simplified agentic APIs - act(), observe(), extract()
feat(automation): add agentic memory for multi-round automation

Assets 2

02 Feb 02:19

j-mendez

v2.43.3

44ce332

v2.43.3

Bug Fix

fix(automation): Improve best_effort_parse_json_object parsing to handle LLM responses with reasoning text before JSON code blocks
- Find ```json blocks anywhere in response (not just at boundaries)
- Support JSON arrays in addition to objects
- Better fallback parsing for various LLM response formats

Full Changelog: v2.43.2...v2.43.3

Assets 2

02 Feb 00:55

j-mendez

v2.43.2

4e0a9e8

v2.43.2

New Feature: Extraction Schema Support

Add JSON Schema support for structured extraction in RemoteMultimodalEngine.

`ExtractionSchema` Struct

pub struct ExtractionSchema {
    pub name: String,           // Schema name (e.g., "products")
    pub description: Option<String>,  // What to extract
    pub schema: String,         // JSON Schema definition
    pub strict: bool,           // Enforce strict adherence
}

Example Usage

use spider::features::automation::{RemoteMultimodalConfigs, ExtractionSchema};

let schema = ExtractionSchema::new_with_description(
    "products",
    "Extract product information",
    r#"{
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": { "type": "string" },
                        "price": { "type": "number" }
                    },
                    "required": ["name", "price"]
                }
            }
        }
    }"#,
).with_strict(true);

let mm = RemoteMultimodalConfigs::new("http://localhost:11434/v1/chat/completions", "model")
    .with_extra_ai_data(true)
    .with_extraction_schema(Some(schema));

Full Changelog: v2.43.1...v2.43.2

Assets 2

01 Feb 23:42

j-mendez

v2.43.1

b78d3d2

v2.43.1

Bug Fix

fix(page): Add missing remote_multimodal_usage and extra_remote_multimodal_data fields to the decentralized Page struct for feature parity with the standard Page struct.

Full Changelog: v2.43.0...v2.43.1

Assets 2

01 Feb 21:40

j-mendez

v2.43.0

50f73ad

v2.43.0

What's New

Token Usage Tracking for RemoteMultimodalEngine

The remote multimodal automation engine now tracks and returns token usage conforming to the OpenAI API format:

AutomationUsage struct with prompt_tokens, completion_tokens, total_tokens
Usage is accumulated across all inference rounds
Stored on Page.remote_multimodal_usage

Extraction Support

New extraction capabilities for RemoteMultimodalEngine, similar to the OpenAI integration:

extra_ai_data - Enable extraction mode
extraction_prompt - Custom extraction instructions
screenshot - Capture final screenshot

Extracted data is automatically stored on Page.extra_remote_multimodal_data as AutomationResults.

Example Usage

use spider::features::automation::RemoteMultimodalConfigs;

let mm = RemoteMultimodalConfigs::new(
    "http://localhost:11434/v1/chat/completions",
    "qwen2.5-vl",
)
.with_extra_ai_data(true)
.with_extraction_prompt(Some("Extract all product names and prices"))
.with_screenshot(true);

website.configuration.remote_multimodal = Some(Box::new(mm));

// After crawling, access on page:
for page in website.get_pages().await {
    if let Some(usage) = &page.remote_multimodal_usage {
        println!("Tokens: {:?}", usage);
    }
    if let Some(data) = &page.extra_remote_multimodal_data {
        for result in data {
            println!("Extracted: {:?}", result.content_output);
        }
    }
}

Full Changelog: v2.42.0...v2.43.0

Assets 2

Releases: spider-rs/spider

v2.45.20

What's New

Relevance Gate for Remote Multimodal Crawling

Full Changelog

Uh oh!

v2.44.13

What's New

Crates

Uh oh!

spider v2.43.20

Spider v2.43.20

Changes

spider_agent Integration

Full Changelog

Uh oh!

spider_agent v0.4.0

Spider Agent v0.4.0

Performance Optimizations

Smart Caching

High-Performance Execution

Smart Model Routing

Other Changes

Full Changelog

Uh oh!

v2.43.18 - Web Search Integration

Features

Web Search Integration

Supported Providers

New Methods

Setup

Cargo.toml

Configuration

Custom API Endpoints

Full Changelog

Uh oh!

v2.43.13 - Advanced Agentic Automation

🤖 Advanced Agentic Automation Features

Phase 1: Simplified Agentic APIs

Phase 2: Self-Healing & Discovery

Phase 3: Autonomous Agent Execution

Example Usage

Full Changelog

Uh oh!

v2.43.3

Bug Fix

Uh oh!

v2.43.2

New Feature: Extraction Schema Support

ExtractionSchema Struct

Example Usage

Uh oh!

v2.43.1

Bug Fix

Uh oh!

v2.43.0

What's New

Token Usage Tracking for RemoteMultimodalEngine

Extraction Support

Example Usage

Uh oh!

`ExtractionSchema` Struct