A Python tool to download and recover archived websites from the Wayback Machine (web.archive.org). This tool crawls an archived snapshot, downloads all content (HTML, images, CSS, JavaScript), and rewrites URLs to create a fully functional static copy that works offline.
- Downloads complete website snapshots from Wayback Machine
- Automatically discovers and follows all internal links
- Downloads all assets: HTML, CSS, JavaScript, images
- Rewrites URLs for local browsing (no internet connection needed)
- Removes Wayback Machine toolbar and scripts
- Preserves original site structure
- Smart rate limiting with configurable delays (default: 1 second between requests)
- Automatic retry logic with exponential backoff for rate limit errors (HTTP 429)
- Connection pooling for improved performance
- Custom User-Agent to identify the tool
- Python 3.8 or higher
- requests library
- beautifulsoup4 library
-
Clone or download this repository
-
Install dependencies:
Using pip:
pip install requests beautifulsoup4Or using uv (recommended):
uv syncOr install directly with uv:
uv pip install requests beautifulsoup4python wayback_downloader.py "WAYBACK_URL"Where WAYBACK_URL is the full Wayback Machine URL including timestamp.
- Go to https://web.archive.org
- Enter the URL of the website you want to recover
- Browse the calendar to find a snapshot
- Click on a timestamp to view the archived page
- Copy the full URL from the browser address bar
The URL should look like:
https://web.archive.org/web/20150315000000/example.com
Download a website to the default directory:
python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com"Specify a custom output directory:
python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" -o my_recovered_siteLimit download to first 50 pages (useful for testing):
python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" --max-pages 50Adjust delay between requests (for faster or slower downloads):
python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" --delay 2.0wayback_url: (Required) Full Wayback Machine URL with timestamp-o, --output: Output directory (default:downloaded_site)--max-pages: Maximum number of pages to download (default: unlimited)--delay: Delay in seconds between requests (default: 1.0, recommended: 1.0-2.0)
The tool includes built-in rate limiting to be respectful to the Internet Archive's servers:
- Default delay: 1 second between requests (recommended)
- Automatic retry: HTTP 429 (rate limit) responses trigger exponential backoff
- Max retries: 5 attempts with increasing delays (2s, 4s, 8s, 16s, 32s)
- User-Agent: Identifies requests as coming from this tool
Important: The Internet Archive enforces rate limits to manage server load:
- Exceeding ~60 requests/minute may trigger HTTP 429 responses
- Persistent violations can result in temporary IP blocks (1+ hours)
- Use
--delay 1.0or higher for large downloads to avoid issues
If you need faster downloads, you can reduce the delay (e.g., --delay 0.5), but monitor for rate limit errors. If you see "Rate limited (429)" messages, the tool will automatically slow down.
- URL Parsing: Extracts the timestamp and original domain from the Wayback Machine URL
- Crawling: Starts from the initial URL and follows all internal links
- Downloading: Downloads HTML pages and all referenced resources (images, CSS, JS)
- URL Rewriting: Converts all links to relative paths for local browsing
- Cleanup: Removes Wayback Machine toolbar and scripts
The downloaded site will be saved in the specified output directory with the same structure as the original site. To view your recovered site:
- Navigate to the output directory
- Open
index.htmlin your web browser - Browse the site normally - all links will work locally
- Start with
--max-pages 10to test before downloading entire sites - Large sites may take a long time - the default 1 second delay between requests ensures respectful usage
- Not all archived pages may be available - the tool will skip missing resources
- Check your output directory periodically to monitor progress
- If you encounter rate limiting (429 errors), the tool will automatically retry with exponential backoff
- For very large sites, consider increasing the delay to 2.0 seconds:
--delay 2.0
Missing dependencies error:
# Using pip
pip install requests beautifulsoup4
# Or using uv
uv syncInvalid Wayback URL error: Make sure your URL includes the timestamp and follows this format:
https://web.archive.org/web/TIMESTAMP/ORIGINAL_URL
Rate limit errors (HTTP 429): The tool automatically handles rate limiting with exponential backoff. If you see repeated rate limit messages:
- Increase the delay:
--delay 2.0or higher - The tool will automatically retry up to 5 times with increasing delays
- Persistent rate limiting may indicate an IP block (wait 1+ hours)
Slow downloads:
This is normal and intentional. The tool uses a 1-second delay between requests to be respectful to archive.org. You can monitor progress in the terminal output. For faster downloads, reduce the delay at your own risk: --delay 0.5
- Only downloads content available in the Wayback Machine
- Some dynamic features may not work (JavaScript-heavy sites, AJAX)
- External resources (from other domains) are not downloaded
- Very large sites may take considerable time to download
The project includes a comprehensive test suite. To run the tests:
- Install development dependencies:
pip install pytest pytest-mockOr with uv (recommended):
uv sync --dev- Run the test suite:
pytestOr with uv:
uv run pytestRun with verbose output:
uv run pytest -vRun with coverage report (requires pytest-cov):
uv run pytest --cov=wayback_downloader --cov-report=htmlThe test suite covers:
- URL parsing and validation
- File type detection
- Internal vs external URL detection
- URL to filepath conversion
- Rate limiting and retry logic
- HTTP 429 handling with exponential backoff
- Session management and headers
The project uses Ruff for linting and formatting - a fast, modern Python linter written in Rust.
Run the linter:
uv run ruff check .Auto-fix linting issues:
uv run ruff check --fix .Format code:
uv run ruff format .Check if code is formatted:
uv run ruff format --check .Run both linting and formatting:
uv run ruff check . && uv run ruff format .This project is licensed under the MIT License - see the LICENSE.txt file for details.
Please be respectful of the Internet Archive's resources and use reasonable rate limiting when downloading archived content.