Skip to content

Make web scraper timeout and delay configurable via environment variables #2045

@Vasilije1990

Description

@Vasilije1990

Problem

The web scraper module has hardcoded timeout and delay values that cannot be customized without modifying code. This limits flexibility for different use cases (e.g., slow networks, rate-limited APIs).

Affected Files

  1. cognee/tasks/web_scraper/config.py

    • timeout: float = 15.0 (hardcoded)
  2. cognee/tasks/web_scraper/default_url_crawler.py (lines 25-26)

    • max_crawl_delay: float = 10.0 (hardcoded)
    • timeout: float = 15.0 (hardcoded)

Proposed Solution

Make these configurable via environment variables with sensible defaults:

In config.py:

import os

class WebScraperConfig:
    timeout: float = float(os.getenv("WEB_SCRAPER_TIMEOUT", "15.0"))
    max_crawl_delay: float = float(os.getenv("WEB_SCRAPER_MAX_DELAY", "10.0"))

Update .env.template:

# Web Scraper Configuration
WEB_SCRAPER_TIMEOUT=15.0
WEB_SCRAPER_MAX_DELAY=10.0

Acceptance Criteria

  • Add environment variable support for timeout values
  • Update .env.template with new variables
  • Keep existing defaults (15.0 and 10.0)
  • Add validation (must be positive floats)
  • Update documentation to mention new env vars
  • Test with custom values to verify they're respected

Benefits

  • Flexibility for different network conditions
  • No code changes needed to adjust timeouts
  • Easier to configure for different environments (dev/staging/prod)
  • Follows existing configuration patterns in Cognee

Similar Issues

Consider also making configurable:

  • cognee/tasks/memify/extract_usage_frequency.py - batch_size: int = 100
  • cognee/tasks/memify/get_triplet_datapoints.py - triplets_batch_size: int = 100
  • cognee/tasks/translation/config.py - min_text_length_for_detection: int = 10

Time Estimate

20-30 minutes

References

  • Check cognee/config/ for existing patterns
  • Review how other modules handle environment-based config

Metadata

Metadata

Assignees

No one assigned

    Labels

    3 pointsCreated by Linear-GitHub SyncMedium priorityCreated by Linear-GitHub SyncenhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions