Vultr Website Scraper

A Node.js web scraper that crawls the Vultr website to extract clean, meaningful content and download PDF documents.

Features

Smart Content Extraction: Removes CSS, JavaScript, navigation menus, and other non-content elements
PDF Download: Automatically downloads PDF files found during crawling
Anti-Bot Protection Bypass: Uses Puppeteer to handle modern anti-bot protection
Configurable Crawling: Respects disallowed paths and avoids revisiting URLs
Clean Output: Produces well-formatted text content suitable for analysis

Prerequisites

Node.js (version 14 or higher)
npm (comes with Node.js)

Installation

Clone this repository:

git clone <your-repo-url>
cd Scrapper

Install dependencies:

npm install

Usage

Puppeteer Version (Recommended)

The Puppeteer version can bypass most anti-bot protection:

node puppeteer-crawl-and-scrape.js

Simple Crawler Version

The original simplecrawler version (may be blocked by anti-bot protection):

node crawl-and-scrape.js

Configuration

You can modify the following variables in the scripts:

rootUrl: The starting URL for crawling (default: 'https://www.vultr.com')
maxDepth: Maximum crawl depth (default: 3 for Puppeteer, 10 for simplecrawler)
disallowedPaths: Paths to skip during crawling
pdfDir: Directory to save downloaded PDFs

Output

Text Content: Clean, extracted text saved to vultr_website_content.txt
PDF Files: Downloaded PDFs saved to the pdfs/ directory
Console Logs: Real-time progress and status updates

Project Structure

Scrapper/
├── crawl-and-scrape.js          # Original simplecrawler version
├── puppeteer-crawl-and-scrape.js # Puppeteer version (recommended)
├── package.json                 # Dependencies and project info
├── README.md                    # This file
├── vultr_website_content.txt    # Output file (generated)
└── pdfs/                        # PDF downloads directory (generated)

Dependencies

puppeteer: Browser automation for bypassing anti-bot protection
simplecrawler: HTTP crawler (original version)
cheerio: HTML parsing and manipulation
axios: HTTP client for PDF downloads

Notes

The Puppeteer version is recommended as it can handle modern anti-bot protection
Be respectful of the target website's robots.txt and terms of service
Consider adding delays between requests for large-scale crawling
The scraper is configured to avoid certain paths that typically contain non-content

License

[Add your license here]

Contributing

[Add contribution guidelines if desired]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
crawl-and-scrape.js		crawl-and-scrape.js
package-lock.json		package-lock.json
package.json		package.json
puppeteer-crawl-and-scrape.js		puppeteer-crawl-and-scrape.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vultr Website Scraper

Features

Prerequisites

Installation

Usage

Puppeteer Version (Recommended)

Simple Crawler Version

Configuration

Output

Project Structure

Dependencies

Notes

License

Contributing

About

Uh oh!

Releases

Packages

Languages

Alison-021/crawler

Folders and files

Latest commit

History

Repository files navigation

Vultr Website Scraper

Features

Prerequisites

Installation

Usage

Puppeteer Version (Recommended)

Simple Crawler Version

Configuration

Output

Project Structure

Dependencies

Notes

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages