A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.
- Features
- Installation
- Usage
- CLI Commands
- API Documentation
- Configuration
- Output Structure
- Examples
- Development
- Contributing
- License
- π Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
- πΈ Full Page Screenshots: Automatically captures full-page screenshots of each visited page
- π Content Extraction: Extracts metadata, headings, paragraphs, and text content
- π― Domain-Scoped: Only crawls internal links within the same domain
- π Interactive CLI: User-friendly command-line interface with input validation
- πΎ Organized Storage: Saves screenshots and content in a structured directory format
- π Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
- π¨ SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
- β±οΈ Timeout Handling: Built-in timeout management for unresponsive pages
npm install -g @harshvz/crawlerNote: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.
npm install @harshvz/crawlerNote: The postinstall script will automatically download the Chromium browser.
If the automatic installation fails, you can manually install browsers:
npx playwright install chromiumgit clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .Simply run the command and follow the prompts:
# Primary command (recommended)
crawler
# Alternative (for backward compatibility)
scraperYou'll be prompted to enter:
- URL: The website URL to scrape (e.g.,
https://example.com) - Algorithm: Choose between
bfsordfs(default: bfs) - Output Directory: Custom save location (default:
~/knowledgeBase)
# Show version
crawler --version
crawler -v
# Show help
crawler --help
crawler -hNote: Both
crawlerandscrapercommands work identically. We recommend usingcrawlerfor new projects.
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2
// Using BFS
await scraper.bfsScrape('/');
// Using DFS
await scraper.dfsScrape('/');# Run in development mode with auto-reload
npm run dev
# Build the project
npm run build
# Start the built version (uses crawler command)
npm startMain class for web scraping operations.
new ScrapperServices(website: string, depth?: number, customPath?: string)Parameters:
website(string): The base URL of the website to scrapedepth(number, optional): Maximum depth to crawl (0 = unlimited, default: 0)customPath(string, optional): Custom output directory path (default:~/knowledgeBase)
Crawls the website using Breadth-First Search algorithm.
Parameters:
endpoint(string): Starting path (default: "/")results(string[]): Array to collect visited endpointsvisited(Record<string, boolean>): Object to track visited URLs
Crawls the website using Depth-First Search algorithm.
Parameters:
endpoint(string): Starting path (default: "/")results(string[]): Array to collect visited endpointsvisited(Record<string, boolean>): Object to track visited URLs
Generates a file path for storing screenshots.
Generates a file path for storing extracted content.
Extracts all internal links from the current page.
The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:
const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 secondsBy default, all scraped data is stored in:
~/knowledgeBase/
Each website gets its own folder based on its hostname.
~/knowledgeBase/
βββ examplecom/
βββ home.png # Screenshot of homepage
βββ home.md # Extracted content from homepage
βββ _about.png # Screenshot of /about page
βββ _about.md # Extracted content from /about
βββ _contact.png # Screenshot of /contact page
βββ _contact.md # Extracted content from /contact
Each .md file contains:
- JSON metadata (first line):
- Page title
- Meta description
- Robots directives
- Open Graph tags
- Twitter Card tags
- Extracted text content (subsequent lines):
- All text from h1-h6, p, and span elements
import ScrapperServices from '@harshvz/crawler';
const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting pageconst scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(`Scraped ${results.length} pages`);const scraper = new ScrapperServices(
'https://example.com',
0, // No depth limit
'/custom/output/path' // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase- Node.js >= 16.x
- npm >= 7.x
# Clone the repository
git clone https://github.com/harshvz/crawler.git
# Navigate to directory
cd crawler
# Install dependencies
npm install
# Run in development mode
npm run devcrawler/
βββ src/
β βββ index.ts # CLI entry point
β βββ Services/
β βββ ScrapperServices.ts # Main scraping logic
βββ dist/ # Compiled JavaScript
βββ package.json
βββ tsconfig.json
βββ README.md
npm run buildThis compiles TypeScript files to JavaScript in the dist/ directory.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
ISC Β© Harshvz
- Built with Playwright
- CLI powered by Inquirer.js
Made with β€οΈ by harshvz
