This tool crawls websites using a headless Chrome environment and extracts structured data from pages with custom JavaScript. It handles deep crawling, URL lists, and concurrency, making large-scale data collection far easier. Built for anyone who needs reliable, automated web scraping.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Web Scraper you've just found your team β Let's Chat. ππ
This project automates the process of navigating websites, executing JavaScript, and collecting structured information from any page it reaches. It solves the challenge of manually gathering data from large or complex sites, especially those that rely heavily on dynamic content. It's designed for developers, analysts, researchers, and anyone who needs fast, consistent web scraping.
- Uses a Chrome-based engine to render modern, dynamic sites accurately.
- Supports recursive crawling with flexible limits and patterns.
- Allows injecting custom JavaScript to extract exactly the data you need.
- Manages concurrency to speed up large scraping jobs.
- Handles URL lists, errors, and retries automatically.
| Feature | Description |
|---|---|
| Chrome-based rendering | Loads modern, JavaScript-heavy sites accurately. |
| Recursive crawling | Automatically follows links based on rules you define. |
| URL list support | Accepts custom lists for targeted scraping tasks. |
| Custom JS extraction | Lets you write JavaScript to extract structured data. |
| Concurrency control | Balances speed and system stability during scraping. |
| Error handling | Retries, logs, and manages failed requests gracefully. |
| Field Name | Field Description |
|---|---|
| url | The page URL that was successfully scraped. |
| title | The pageβs visible title or heading. |
| html | The inner HTML captured for deeper analysis. |
| metadata | Key metadata such as description, keywords, and tags. |
| extractedData | Custom extraction results from user-defined JavaScript. |
[
{
"url": "https://example.com/products/1",
"title": "Sample Product",
"html": "<div>...</div>",
"metadata": { "description": "Product page", "keywords": ["sample", "product"] },
"extractedData": { "price": "$29.99", "inStock": true }
}
]
Web Scraper/
βββ src/
β βββ main.js
β βββ crawler/
β β βββ chrome_runner.js
β β βββ link_finder.js
β βββ extractors/
β β βββ custom_js_executor.js
β β βββ data_parser.js
β βββ utils/
β β βββ logger.js
β β βββ request_queue.js
β βββ config/
β βββ settings.example.json
βββ data/
β βββ urls.sample.txt
β βββ output.sample.json
βββ package.json
βββ README.md
- Researchers use it to collect structured information from academic or public sites, so they can build datasets quickly.
- Agencies use it to gather pricing and product information, so they can stay competitive.
- Developers use it to automate repetitive data-gathering tasks, so they can focus on core logic instead of manual scraping.
- Analysts use it to capture trends across large sets of pages, so they can generate reports with confidence.
- QA teams use it to check content consistency across site sections, so they can spot issues faster.
Does this scraper work on dynamic sites? Yes. It uses a Chrome environment capable of rendering JavaScript-heavy pages just like a real browser.
Can I control how deep the scraper crawls? Absolutely. You can set crawl depth, restrict domains, or apply patterns to fine-tune crawling behavior.
Is custom JavaScript required? Not always. Default extraction covers common fields, but custom JS gives you full control when needed.
How does it handle large URL lists? It processes them efficiently with concurrency controls and retry logic to keep the workflow stable.
Primary Metric: Handles an average of 40β60 pages per minute with Chrome-based rendering, depending on page complexity.
Reliability Metric: Maintains a 95%+ successful extraction rate across large crawling batches.
Efficiency Metric: Optimizes CPU usage by balancing browser sessions and concurrency for stable long-running operations.
Quality Metric: Delivers high data completeness with consistent extraction even on dynamic, script-heavy websites.
