Wayback Machine Downloader

A Python tool to download and recover archived websites from the Wayback Machine (web.archive.org). This tool crawls an archived snapshot, downloads all content (HTML, images, CSS, JavaScript), and rewrites URLs to create a fully functional static copy that works offline.

Features

Downloads complete website snapshots from Wayback Machine
Automatically discovers and follows all internal links
Downloads all assets: HTML, CSS, JavaScript, images
Rewrites URLs for local browsing (no internet connection needed)
Removes Wayback Machine toolbar and scripts
Preserves original site structure
Smart rate limiting with configurable delays (default: 1 second between requests)
Automatic retry logic with exponential backoff for rate limit errors (HTTP 429)
Connection pooling for improved performance
Custom User-Agent to identify the tool

Requirements

Python 3.8 or higher
requests library
beautifulsoup4 library

Installation

Clone or download this repository
Install dependencies:

Using pip:

pip install requests beautifulsoup4

Or using uv (recommended):

uv sync

Or install directly with uv:

uv pip install requests beautifulsoup4

Usage

Basic Usage

python wayback_downloader.py "WAYBACK_URL"

Where WAYBACK_URL is the full Wayback Machine URL including timestamp.

Finding Your Wayback URL

Go to https://web.archive.org
Enter the URL of the website you want to recover
Browse the calendar to find a snapshot
Click on a timestamp to view the archived page
Copy the full URL from the browser address bar

The URL should look like:

https://web.archive.org/web/20150315000000/example.com

Examples

Download a website to the default directory:

python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com"

Specify a custom output directory:

python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" -o my_recovered_site

Limit download to first 50 pages (useful for testing):

python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" --max-pages 50

Adjust delay between requests (for faster or slower downloads):

python wayback_downloader.py "https://web.archive.org/web/20150315000000/example.com" --delay 2.0

Command-Line Options

wayback_url: (Required) Full Wayback Machine URL with timestamp
-o, --output: Output directory (default: downloaded_site)
--max-pages: Maximum number of pages to download (default: unlimited)
--delay: Delay in seconds between requests (default: 1.0, recommended: 1.0-2.0)

Rate Limiting

The tool includes built-in rate limiting to be respectful to the Internet Archive's servers:

Default delay: 1 second between requests (recommended)
Automatic retry: HTTP 429 (rate limit) responses trigger exponential backoff
Max retries: 5 attempts with increasing delays (2s, 4s, 8s, 16s, 32s)
User-Agent: Identifies requests as coming from this tool

Important: The Internet Archive enforces rate limits to manage server load:

Exceeding ~60 requests/minute may trigger HTTP 429 responses
Persistent violations can result in temporary IP blocks (1+ hours)
Use --delay 1.0 or higher for large downloads to avoid issues

If you need faster downloads, you can reduce the delay (e.g., --delay 0.5), but monitor for rate limit errors. If you see "Rate limited (429)" messages, the tool will automatically slow down.

How It Works

URL Parsing: Extracts the timestamp and original domain from the Wayback Machine URL
Crawling: Starts from the initial URL and follows all internal links
Downloading: Downloads HTML pages and all referenced resources (images, CSS, JS)
URL Rewriting: Converts all links to relative paths for local browsing
Cleanup: Removes Wayback Machine toolbar and scripts

Output

The downloaded site will be saved in the specified output directory with the same structure as the original site. To view your recovered site:

Navigate to the output directory
Open index.html in your web browser
Browse the site normally - all links will work locally

Tips

Start with --max-pages 10 to test before downloading entire sites
Large sites may take a long time - the default 1 second delay between requests ensures respectful usage
Not all archived pages may be available - the tool will skip missing resources
Check your output directory periodically to monitor progress
If you encounter rate limiting (429 errors), the tool will automatically retry with exponential backoff
For very large sites, consider increasing the delay to 2.0 seconds: --delay 2.0

Troubleshooting

Missing dependencies error:

# Using pip
pip install requests beautifulsoup4

# Or using uv
uv sync

Invalid Wayback URL error: Make sure your URL includes the timestamp and follows this format:

https://web.archive.org/web/TIMESTAMP/ORIGINAL_URL

Rate limit errors (HTTP 429): The tool automatically handles rate limiting with exponential backoff. If you see repeated rate limit messages:

Increase the delay: --delay 2.0 or higher
The tool will automatically retry up to 5 times with increasing delays
Persistent rate limiting may indicate an IP block (wait 1+ hours)

Slow downloads: This is normal and intentional. The tool uses a 1-second delay between requests to be respectful to archive.org. You can monitor progress in the terminal output. For faster downloads, reduce the delay at your own risk: --delay 0.5

Limitations

Only downloads content available in the Wayback Machine
Some dynamic features may not work (JavaScript-heavy sites, AJAX)
External resources (from other domains) are not downloaded
Very large sites may take considerable time to download

Development

Running Tests

The project includes a comprehensive test suite. To run the tests:

Install development dependencies:

pip install pytest pytest-mock

Or with uv (recommended):

uv sync --dev

Run the test suite:

pytest

Or with uv:

uv run pytest

Run with verbose output:

uv run pytest -v

Run with coverage report (requires pytest-cov):

uv run pytest --cov=wayback_downloader --cov-report=html

Test Coverage

The test suite covers:

URL parsing and validation
File type detection
Internal vs external URL detection
URL to filepath conversion
Rate limiting and retry logic
HTTP 429 handling with exponential backoff
Session management and headers

Code Linting and Formatting

The project uses Ruff for linting and formatting - a fast, modern Python linter written in Rust.

Run the linter:

uv run ruff check .

Auto-fix linting issues:

uv run ruff check --fix .

Format code:

uv run ruff format .

Check if code is formatted:

uv run ruff format --check .

Run both linting and formatting:

uv run ruff check . && uv run ruff format .

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Please be respectful of the Internet Archive's resources and use reasonable rate limiting when downloading archived content.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
test_wayback_downloader.py		test_wayback_downloader.py
uv.lock		uv.lock
wayback_downloader.py		wayback_downloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wayback Machine Downloader

Features

Requirements

Installation

Usage

Basic Usage

Finding Your Wayback URL

Examples

Command-Line Options

Rate Limiting

How It Works

Output

Tips

Troubleshooting

Limitations

Development

Running Tests

Test Coverage

Code Linting and Formatting

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

heikkitoivonen/wayback_downloader

Folders and files

Latest commit

History

Repository files navigation

Wayback Machine Downloader

Features

Requirements

Installation

Usage

Basic Usage

Finding Your Wayback URL

Examples

Command-Line Options

Rate Limiting

How It Works

Output

Tips

Troubleshooting

Limitations

Development

Running Tests

Test Coverage

Code Linting and Formatting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages