Skip to content

An automated tool for scraping and classifying hCaptcha challenges using Patchright.

License

Notifications You must be signed in to change notification settings

sexfrance/hcaptcha-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hCaptcha Challenger

An automated tool for scraping and classifying hCaptcha challenges using Playwright/Patchright.

💬 Discord · 📜 ChangeLog · ⚠️ Report Bug · 💡 Request Feature


⚙️ Installation

  • Requires: Python 3.8+
  • Create a virtual environment:
    python -m venv venv
  • Activate the environment:
    • Windows: venv\Scripts\activate
    • macOS/Linux: source venv/bin/activate
  • Install dependencies:
    pip install -r requirements.txt
    playwright install chromium

🔥 Features

  • Automated Scraping: Uses Patchright to interact with hCaptcha demos and capture challenge images.
  • Smart Classification: Heuristic-based classification of challenge types (Single Select, Multi Select, Drag & Drop).
  • Multi-threaded: Supports multiple workers for high-speed data collection.
  • Proxy Support: Robust proxy handling with support for various formats (IP:Port, User:Pass@IP:Port, etc.).
  • Dataset Management: Automated organization of captured images into structured folders based on challenge type and prompt.
  • Utility Scripts: Built-in scripts for dataset statistics and post-capture image organization.
  • Configurable: Easily adjust settings through input/config.toml.

📁 Directory Structure

  • main.py: The core automation engine that runs the scraper.
  • input/:
    • config.toml: Main configuration file.
    • proxies.txt: File containing proxies to use (one per line).
  • scripts/:
    • classify_images.py: Re-scans the output folder to organize images using OCR and prompt heuristics.
    • stats.py: Provides a detailed report on the current dataset (types, questions, image counts).
  • output/: Automatic directory where captured and classified images are stored.

📝 Usage

  1. Configuration: Edit input/config.toml to set your desired thread count, logging level, and ignore lists:

    [dev]
    Proxyless = true   # Set to false to use proxies from input/proxies.txt
    Debug = false      # Enable for detailed execution logs
    Threads = 5        # Number of concurrent workers
    minimal = true     # Only show core log messages
    ignore_types = []  # List of challenge types to skip
  2. Run the Scraper:

    python main.py
  3. Get Dataset Stats:

    python scripts/stats.py
  4. Re-classify Images:

    python scripts/classify_images.py --folder output/ --move

❗ Disclaimers

  • This project is for educational purposes only.
  • The author is not responsible for any misuse.
  • Ensure your use complies with the terms of service of any sites accessed.

📜 ChangeLog

v1.0.0 ⋮ 12/30/2024
+ Initial release of the comprehensive hCaptcha Challenger
+ Integrated multi-threading and proxy support
+ Added dataset management scripts