Skip to content

A flexible web crawler and scraping tool using Playwright, supporting both BFS and DFS crawling strategies with screenshot capture and structured output. Installable via npm and usable both as a CLI and programmatically.

License

Notifications You must be signed in to change notification settings

HarshVz/crawler

Repository files navigation

npm version

πŸ•·οΈ @harshvz/crawler

A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.

npm version License: ISC

πŸ“‹ Table of Contents

✨ Features

  • πŸ” Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
  • πŸ“Έ Full Page Screenshots: Automatically captures full-page screenshots of each visited page
  • πŸ“ Content Extraction: Extracts metadata, headings, paragraphs, and text content
  • 🎯 Domain-Scoped: Only crawls internal links within the same domain
  • πŸš€ Interactive CLI: User-friendly command-line interface with input validation
  • πŸ’Ύ Organized Storage: Saves screenshots and content in a structured directory format
  • πŸ”„ Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
  • 🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
  • ⏱️ Timeout Handling: Built-in timeout management for unresponsive pages

πŸ“¦ Installation

As a Global CLI Tool

npm install -g @harshvz/crawler

Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.

As a Project Dependency

npm install @harshvz/crawler

Note: The postinstall script will automatically download the Chromium browser.

Manual Browser Installation (if needed)

If the automatic installation fails, you can manually install browsers:

npx playwright install chromium

From Source

git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .

πŸš€ Usage

CLI Mode (Interactive)

Simply run the command and follow the prompts:

# Primary command (recommended)
crawler

# Alternative (for backward compatibility)
scraper

You'll be prompted to enter:

  1. URL: The website URL to scrape (e.g., https://example.com)
  2. Algorithm: Choose between bfs or dfs (default: bfs)
  3. Output Directory: Custom save location (default: ~/knowledgeBase)

Command-Line Flags

# Show version
crawler --version
crawler -v

# Show help
crawler --help
crawler -h

Note: Both crawler and scraper commands work identically. We recommend using crawler for new projects.

Programmatic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2

// Using BFS
await scraper.bfsScrape('/');

// Using DFS
await scraper.dfsScrape('/');

πŸ› οΈ CLI Commands

Development

# Run in development mode with auto-reload
npm run dev

# Build the project
npm run build

# Start the built version (uses crawler command)
npm start

πŸ“š API Documentation

ScrapperServices

Main class for web scraping operations.

Constructor

new ScrapperServices(website: string, depth?: number, customPath?: string)

Parameters:

  • website (string): The base URL of the website to scrape
  • depth (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)
  • customPath (string, optional): Custom output directory path (default: ~/knowledgeBase)

Methods

bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

Crawls the website using Breadth-First Search algorithm.

Parameters:

  • endpoint (string): Starting path (default: "/")
  • results (string[]): Array to collect visited endpoints
  • visited (Record<string, boolean>): Object to track visited URLs
dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

Crawls the website using Depth-First Search algorithm.

Parameters:

  • endpoint (string): Starting path (default: "/")
  • results (string[]): Array to collect visited endpoints
  • visited (Record<string, boolean>): Object to track visited URLs
buildFilePath(endpoint: string): string

Generates a file path for storing screenshots.

buildContentPath(endpoint: string): string

Generates a file path for storing extracted content.

getLinks(page: Page): Promise<string[]>

Extracts all internal links from the current page.

βš™οΈ Configuration

Timeout

The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:

const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 seconds

Storage Location

By default, all scraped data is stored in:

~/knowledgeBase/

Each website gets its own folder based on its hostname.

πŸ“ Output Structure

~/knowledgeBase/
└── examplecom/
    β”œβ”€β”€ home.png                 # Screenshot of homepage
    β”œβ”€β”€ home.md                  # Extracted content from homepage
    β”œβ”€β”€ _about.png              # Screenshot of /about page
    β”œβ”€β”€ _about.md               # Extracted content from /about
    β”œβ”€β”€ _contact.png            # Screenshot of /contact page
    └── _contact.md             # Extracted content from /contact

Content File Format (.md)

Each .md file contains:

  1. JSON metadata (first line):
    • Page title
    • Meta description
    • Robots directives
    • Open Graph tags
    • Twitter Card tags
  2. Extracted text content (subsequent lines):
    • All text from h1-h6, p, and span elements

πŸ“– Examples

Example 1: Basic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');

Example 2: Limited Depth Crawl

const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting page

Example 3: Custom Endpoint

const scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(`Scraped ${results.length} pages`);

Example 4: Custom Output Directory

const scraper = new ScrapperServices(
    'https://example.com',
    0,  // No depth limit
    '/custom/output/path'  // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase

πŸ”§ Development

Prerequisites

  • Node.js >= 16.x
  • npm >= 7.x

Setup

# Clone the repository
git clone https://github.com/harshvz/crawler.git

# Navigate to directory
cd crawler

# Install dependencies
npm install

# Run in development mode
npm run dev

Project Structure

crawler/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ index.ts                    # CLI entry point
β”‚   └── Services/
β”‚       └── ScrapperServices.ts     # Main scraping logic
β”œβ”€β”€ dist/                           # Compiled JavaScript
β”œβ”€β”€ package.json
β”œβ”€β”€ tsconfig.json
└── README.md

Building

npm run build

This compiles TypeScript files to JavaScript in the dist/ directory.

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

ISC Β© Harshvz

πŸ™ Acknowledgments


Made with ❀️ by harshvz

About

A flexible web crawler and scraping tool using Playwright, supporting both BFS and DFS crawling strategies with screenshot capture and structured output. Installable via npm and usable both as a CLI and programmatically.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published