Skip to content

Alison-021/crawler

Repository files navigation

Vultr Website Scraper

A Node.js web scraper that crawls the Vultr website to extract clean, meaningful content and download PDF documents.

Features

  • Smart Content Extraction: Removes CSS, JavaScript, navigation menus, and other non-content elements
  • PDF Download: Automatically downloads PDF files found during crawling
  • Anti-Bot Protection Bypass: Uses Puppeteer to handle modern anti-bot protection
  • Configurable Crawling: Respects disallowed paths and avoids revisiting URLs
  • Clean Output: Produces well-formatted text content suitable for analysis

Prerequisites

  • Node.js (version 14 or higher)
  • npm (comes with Node.js)

Installation

  1. Clone this repository:
git clone <your-repo-url>
cd Scrapper
  1. Install dependencies:
npm install

Usage

Puppeteer Version (Recommended)

The Puppeteer version can bypass most anti-bot protection:

node puppeteer-crawl-and-scrape.js

Simple Crawler Version

The original simplecrawler version (may be blocked by anti-bot protection):

node crawl-and-scrape.js

Configuration

You can modify the following variables in the scripts:

  • rootUrl: The starting URL for crawling (default: 'https://www.vultr.com')
  • maxDepth: Maximum crawl depth (default: 3 for Puppeteer, 10 for simplecrawler)
  • disallowedPaths: Paths to skip during crawling
  • pdfDir: Directory to save downloaded PDFs

Output

  • Text Content: Clean, extracted text saved to vultr_website_content.txt
  • PDF Files: Downloaded PDFs saved to the pdfs/ directory
  • Console Logs: Real-time progress and status updates

Project Structure

Scrapper/
├── crawl-and-scrape.js          # Original simplecrawler version
├── puppeteer-crawl-and-scrape.js # Puppeteer version (recommended)
├── package.json                 # Dependencies and project info
├── README.md                    # This file
├── vultr_website_content.txt    # Output file (generated)
└── pdfs/                        # PDF downloads directory (generated)

Dependencies

  • puppeteer: Browser automation for bypassing anti-bot protection
  • simplecrawler: HTTP crawler (original version)
  • cheerio: HTML parsing and manipulation
  • axios: HTTP client for PDF downloads

Notes

  • The Puppeteer version is recommended as it can handle modern anti-bot protection
  • Be respectful of the target website's robots.txt and terms of service
  • Consider adding delays between requests for large-scale crawling
  • The scraper is configured to avoid certain paths that typically contain non-content

License

[Add your license here]

Contributing

[Add contribution guidelines if desired]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published