Implement Firecrawl caching to reduce API credit usage (#46)#68
Draft
mathurshrenya wants to merge 1 commit intopromptdriven:mainfrom
Draft
Implement Firecrawl caching to reduce API credit usage (#46)#68mathurshrenya wants to merge 1 commit intopromptdriven:mainfrom
mathurshrenya wants to merge 1 commit intopromptdriven:mainfrom
Conversation
This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue promptdriven#46: Cache firecrawl results so it doesn't use up the API credit. Features implemented: - SQLite-based persistent caching with configurable TTL - URL normalization for consistent cache keys - Automatic cleanup and size management - Dual-layer caching (client-side + Firecrawl's maxAge parameter) - CLI commands for cache management (stats, clear, info, check) - Environment variable configuration - Comprehensive test suite with 20+ test cases - Complete documentation with usage examples Files added: - pdd/firecrawl_cache.py: Core caching functionality - pdd/firecrawl_cache_cli.py: CLI commands for cache management - tests/test_firecrawl_cache.py: Comprehensive test suite - docs/firecrawl-caching.md: Complete documentation Files modified: - pdd/preprocess.py: Updated to use caching with dual-layer approach - pdd/cli.py: Added firecrawl-cache command group Configuration options: - FIRECRAWL_CACHE_ENABLE (default: true) - FIRECRAWL_CACHE_TTL_HOURS (default: 24) - FIRECRAWL_CACHE_MAX_SIZE_MB (default: 100) - FIRECRAWL_CACHE_MAX_ENTRIES (default: 1000) - FIRECRAWL_CACHE_AUTO_CLEANUP (default: true) CLI commands: - pdd firecrawl-cache stats: View cache statistics - pdd firecrawl-cache clear: Clear all cached entries - pdd firecrawl-cache info: Show configuration - pdd firecrawl-cache check --url <url>: Check specific URL Benefits: - Significant reduction in API credit usage - Faster response times for cached content - Improved reliability with offline capability - Transparent integration with existing <web> tags - Comprehensive management through CLI tools
Contributor
|
target 10/17 |
Contributor
|
target 10/20 |
Contributor
Author
|
Target 10/22 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit implements comprehensive caching functionality for Firecrawl web scraping to address issue #46: Cache firecrawl results so it doesn't use up the API credit.
Features implemented:
Files added:
Files modified:
Configuration options:
CLI commands:
Benefits: