GitHub - sexfrance/hcaptcha-scraper: An automated tool for scraping and classifying hCaptcha challenges using Patchright.

hCaptcha Challenger

An automated tool for scraping and classifying hCaptcha challenges using Playwright/Patchright.

💬 Discord · 📜 ChangeLog · ⚠️ Report Bug · 💡 Request Feature

⚙️ Installation

Requires: Python 3.8+
Create a virtual environment:
```
python -m venv venv
```
Activate the environment:
- Windows: venv\Scripts\activate
- macOS/Linux: source venv/bin/activate

Install dependencies:

pip install -r requirements.txt
playwright install chromium

🔥 Features

Automated Scraping: Uses Patchright to interact with hCaptcha demos and capture challenge images.
Smart Classification: Heuristic-based classification of challenge types (Single Select, Multi Select, Drag & Drop).
Multi-threaded: Supports multiple workers for high-speed data collection.
Proxy Support: Robust proxy handling with support for various formats (IP:Port, User:Pass@IP:Port, etc.).
Dataset Management: Automated organization of captured images into structured folders based on challenge type and prompt.
Utility Scripts: Built-in scripts for dataset statistics and post-capture image organization.
Configurable: Easily adjust settings through input/config.toml.

📁 Directory Structure

main.py: The core automation engine that runs the scraper.
input/:
- config.toml: Main configuration file.
- proxies.txt: File containing proxies to use (one per line).
scripts/:
- classify_images.py: Re-scans the output folder to organize images using OCR and prompt heuristics.
- stats.py: Provides a detailed report on the current dataset (types, questions, image counts).
output/: Automatic directory where captured and classified images are stored.

📝 Usage

Configuration: Edit input/config.toml to set your desired thread count, logging level, and ignore lists:

[dev]
Proxyless = true   # Set to false to use proxies from input/proxies.txt
Debug = false      # Enable for detailed execution logs
Threads = 5        # Number of concurrent workers
minimal = true     # Only show core log messages
ignore_types = []  # List of challenge types to skip

Run the Scraper:
```
python main.py
```
Get Dataset Stats:
```
python scripts/stats.py
```

Re-classify Images:

python scripts/classify_images.py --folder output/ --move

❗ Disclaimers

This project is for educational purposes only.
The author is not responsible for any misuse.
Ensure your use complies with the terms of service of any sites accessed.

📜 ChangeLog

v1.0.0 ⋮ 12/30/2024
+ Initial release of the comprehensive hCaptcha Challenger
+ Integrated multi-threading and proxy support
+ Added dataset management scripts

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
input		input
reverse		reverse
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hCaptcha Challenger

⚙️ Installation

🔥 Features

📁 Directory Structure

📝 Usage

❗ Disclaimers

📜 ChangeLog

About

Uh oh!

Releases

Packages

Languages

License

sexfrance/hcaptcha-scraper

Folders and files

Latest commit

History

Repository files navigation

hCaptcha Challenger

⚙️ Installation

🔥 Features

📁 Directory Structure

📝 Usage

❗ Disclaimers

📜 ChangeLog

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages