🕷️ @harshvz/crawler

A powerful web scraping tool built with Playwright that crawls websites using BFS or DFS algorithms, captures screenshots, and extracts content.

📋 Table of Contents

Features
Installation
Usage
CLI Commands
API Documentation
Configuration
Output Structure
Examples
Development
Contributing
License

✨ Features

🔍 Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
📸 Full Page Screenshots: Automatically captures full-page screenshots of each visited page
📝 Content Extraction: Extracts metadata, headings, paragraphs, and text content
🎯 Domain-Scoped: Only crawls internal links within the same domain
🚀 Interactive CLI: User-friendly command-line interface with input validation
💾 Organized Storage: Saves screenshots and content in a structured directory format
🔄 Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
⏱️ Timeout Handling: Built-in timeout management for unresponsive pages

📦 Installation

As a Global CLI Tool

npm install -g @harshvz/crawler

Note: Chromium browser will be automatically downloaded during installation (approximately 300MB). This is required for web scraping functionality.

As a Project Dependency

npm install @harshvz/crawler

Note: The postinstall script will automatically download the Chromium browser.

Manual Browser Installation (if needed)

If the automatic installation fails, you can manually install browsers:

npx playwright install chromium

From Source

git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .

🚀 Usage

CLI Mode (Interactive)

Simply run the command and follow the prompts:

# Primary command (recommended)
crawler

# Alternative (for backward compatibility)
scraper

You'll be prompted to enter:

URL: The website URL to scrape (e.g., https://example.com)
Algorithm: Choose between bfs or dfs (default: bfs)
Output Directory: Custom save location (default: ~/knowledgeBase)

Command-Line Flags

# Show version
crawler --version
crawler -v

# Show help
crawler --help
crawler -h

Note: Both crawler and scraper commands work identically. We recommend using crawler for new projects.

Programmatic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://example.com', 2); // depth limit of 2

// Using BFS
await scraper.bfsScrape('/');

// Using DFS
await scraper.dfsScrape('/');

🛠️ CLI Commands

Development

# Run in development mode with auto-reload
npm run dev

# Build the project
npm run build

# Start the built version (uses crawler command)
npm start

📚 API Documentation

`ScrapperServices`

Main class for web scraping operations.

Constructor

new ScrapperServices(website: string, depth?: number, customPath?: string)

Parameters:

website (string): The base URL of the website to scrape
depth (number, optional): Maximum depth to crawl (0 = unlimited, default: 0)
customPath (string, optional): Custom output directory path (default: ~/knowledgeBase)

Methods

`bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>`

Crawls the website using Breadth-First Search algorithm.

Parameters:

endpoint (string): Starting path (default: "/")
results (string[]): Array to collect visited endpoints
visited (Record<string, boolean>): Object to track visited URLs

`dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>`

Crawls the website using Depth-First Search algorithm.

Parameters:

endpoint (string): Starting path (default: "/")
results (string[]): Array to collect visited endpoints
visited (Record<string, boolean>): Object to track visited URLs

`buildFilePath(endpoint: string): string`

Generates a file path for storing screenshots.

`buildContentPath(endpoint: string): string`

Generates a file path for storing extracted content.

`getLinks(page: Page): Promise<string[]>`

Extracts all internal links from the current page.

⚙️ Configuration

Timeout

The default timeout for page navigation is 60 seconds. You can modify this by editing the timeout property in the ScrapperServices class:

const scraper = new ScrapperServices('https://example.com');
scraper.timeout = 30000; // 30 seconds

Storage Location

By default, all scraped data is stored in:

~/knowledgeBase/

Each website gets its own folder based on its hostname.

📁 Output Structure

~/knowledgeBase/
└── examplecom/
    ├── home.png                 # Screenshot of homepage
    ├── home.md                  # Extracted content from homepage
    ├── _about.png              # Screenshot of /about page
    ├── _about.md               # Extracted content from /about
    ├── _contact.png            # Screenshot of /contact page
    └── _contact.md             # Extracted content from /contact

Content File Format (.md)

Each .md file contains:

JSON metadata (first line):
- Page title
- Meta description
- Robots directives
- Open Graph tags
- Twitter Card tags
Extracted text content (subsequent lines):
- All text from h1-h6, p, and span elements

📖 Examples

Example 1: Basic Usage

import ScrapperServices from '@harshvz/crawler';

const scraper = new ScrapperServices('https://docs.example.com');
await scraper.bfsScrape('/');

Example 2: Limited Depth Crawl

const scraper = new ScrapperServices('https://blog.example.com', 2);
await scraper.dfsScrape('/');
// Only crawls 2 levels deep from the starting page

Example 3: Custom Endpoint

const scraper = new ScrapperServices('https://example.com');
const results = [];
const visited = {};
await scraper.bfsScrape('/docs', results, visited);
console.log(`Scraped ${results.length} pages`);

Example 4: Custom Output Directory

const scraper = new ScrapperServices(
    'https://example.com',
    0,  // No depth limit
    '/custom/output/path'  // Custom save location
);
await scraper.bfsScrape('/');
// Files will be saved to /custom/output/path instead of ~/knowledgeBase

🔧 Development

Prerequisites

Node.js >= 16.x
npm >= 7.x

Setup

# Clone the repository
git clone https://github.com/harshvz/crawler.git

# Navigate to directory
cd crawler

# Install dependencies
npm install

# Run in development mode
npm run dev

Project Structure

crawler/
├── src/
│   ├── index.ts                    # CLI entry point
│   └── Services/
│       └── ScrapperServices.ts     # Main scraping logic
├── dist/                           # Compiled JavaScript
├── package.json
├── tsconfig.json
└── README.md

Building

npm run build

This compiles TypeScript files to JavaScript in the dist/ directory.

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

ISC © Harshvz

🙏 Acknowledgments

Built with Playwright
CLI powered by Inquirer.js

Made with ❤️ by harshvz

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
.npmignore		.npmignore
BROWSER_INSTALLATION.md		BROWSER_INSTALLATION.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PUBLISHING.md		PUBLISHING.md
PUBLISH_READY.md		PUBLISH_READY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

HarshVz/crawler

Folders and files

Latest commit

History

Repository files navigation

🕷️ @harshvz/crawler

📋 Table of Contents

✨ Features

📦 Installation

As a Global CLI Tool

As a Project Dependency

Manual Browser Installation (if needed)

From Source

🚀 Usage

CLI Mode (Interactive)

Command-Line Flags

Programmatic Usage

🛠️ CLI Commands

Development

📚 API Documentation

ScrapperServices

Constructor

Methods

bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>

buildFilePath(endpoint: string): string

buildContentPath(endpoint: string): string

getLinks(page: Page): Promise<string[]>

⚙️ Configuration

Timeout

Storage Location

📁 Output Structure

Content File Format (.md)

📖 Examples

Example 1: Basic Usage

Example 2: Limited Depth Crawl

Example 3: Custom Endpoint

Example 4: Custom Output Directory

🔧 Development

Prerequisites

Setup

Project Structure

Building

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ScrapperServices`

`bfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>`

`dfsScrape(endpoint?: string, results?: string[], visited?: Record<string, boolean>): Promise<void>`

`buildFilePath(endpoint: string): string`

`buildContentPath(endpoint: string): string`

`getLinks(page: Page): Promise<string[]>`

Packages