An automated tool for scraping and classifying hCaptcha challenges using Playwright/Patchright.
💬 Discord
·
📜 ChangeLog
·
- Requires:
Python 3.8+ - Create a virtual environment:
python -m venv venv
- Activate the environment:
- Windows:
venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
- Install dependencies:
pip install -r requirements.txt playwright install chromium
- Automated Scraping: Uses Patchright to interact with hCaptcha demos and capture challenge images.
- Smart Classification: Heuristic-based classification of challenge types (Single Select, Multi Select, Drag & Drop).
- Multi-threaded: Supports multiple workers for high-speed data collection.
- Proxy Support: Robust proxy handling with support for various formats (IP:Port, User:Pass@IP:Port, etc.).
- Dataset Management: Automated organization of captured images into structured folders based on challenge type and prompt.
- Utility Scripts: Built-in scripts for dataset statistics and post-capture image organization.
- Configurable: Easily adjust settings through
input/config.toml.
main.py: The core automation engine that runs the scraper.input/:config.toml: Main configuration file.proxies.txt: File containing proxies to use (one per line).
scripts/:classify_images.py: Re-scans the output folder to organize images using OCR and prompt heuristics.stats.py: Provides a detailed report on the current dataset (types, questions, image counts).
output/: Automatic directory where captured and classified images are stored.
-
Configuration: Edit
input/config.tomlto set your desired thread count, logging level, and ignore lists:[dev] Proxyless = true # Set to false to use proxies from input/proxies.txt Debug = false # Enable for detailed execution logs Threads = 5 # Number of concurrent workers minimal = true # Only show core log messages ignore_types = [] # List of challenge types to skip
-
Run the Scraper:
python main.py
-
Get Dataset Stats:
python scripts/stats.py
-
Re-classify Images:
python scripts/classify_images.py --folder output/ --move
- This project is for educational purposes only.
- The author is not responsible for any misuse.
- Ensure your use complies with the terms of service of any sites accessed.
v1.0.0 ⋮ 12/30/2024
+ Initial release of the comprehensive hCaptcha Challenger
+ Integrated multi-threading and proxy support
+ Added dataset management scripts