A Node.js web scraper that crawls the Vultr website to extract clean, meaningful content and download PDF documents.
- Smart Content Extraction: Removes CSS, JavaScript, navigation menus, and other non-content elements
- PDF Download: Automatically downloads PDF files found during crawling
- Anti-Bot Protection Bypass: Uses Puppeteer to handle modern anti-bot protection
- Configurable Crawling: Respects disallowed paths and avoids revisiting URLs
- Clean Output: Produces well-formatted text content suitable for analysis
- Node.js (version 14 or higher)
- npm (comes with Node.js)
- Clone this repository:
git clone <your-repo-url>
cd Scrapper- Install dependencies:
npm installThe Puppeteer version can bypass most anti-bot protection:
node puppeteer-crawl-and-scrape.jsThe original simplecrawler version (may be blocked by anti-bot protection):
node crawl-and-scrape.jsYou can modify the following variables in the scripts:
rootUrl: The starting URL for crawling (default: 'https://www.vultr.com')maxDepth: Maximum crawl depth (default: 3 for Puppeteer, 10 for simplecrawler)disallowedPaths: Paths to skip during crawlingpdfDir: Directory to save downloaded PDFs
- Text Content: Clean, extracted text saved to
vultr_website_content.txt - PDF Files: Downloaded PDFs saved to the
pdfs/directory - Console Logs: Real-time progress and status updates
Scrapper/
├── crawl-and-scrape.js # Original simplecrawler version
├── puppeteer-crawl-and-scrape.js # Puppeteer version (recommended)
├── package.json # Dependencies and project info
├── README.md # This file
├── vultr_website_content.txt # Output file (generated)
└── pdfs/ # PDF downloads directory (generated)
puppeteer: Browser automation for bypassing anti-bot protectionsimplecrawler: HTTP crawler (original version)cheerio: HTML parsing and manipulationaxios: HTTP client for PDF downloads
- The Puppeteer version is recommended as it can handle modern anti-bot protection
- Be respectful of the target website's robots.txt and terms of service
- Consider adding delays between requests for large-scale crawling
- The scraper is configured to avoid certain paths that typically contain non-content
[Add your license here]
[Add contribution guidelines if desired]