Skip to content

grey-land/warc-browser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

warc-browser

a cli toolkit for working with web archives.

warc-browser uses DevTools protocol to automate compatible web browsers, captures all content for given wep page (html, css, js, images, videos, pdfs, ...) and stores the results in .warc file. It came out of need for quickly archiving web pages in a scriptable manner.

Installation

make build
./warc-browser --help

Usage

Archive a url running browser in headless mode.

warc-browser --output-dir /tmp/archives browser --headless archive --url http://example.com

Attach to a running browser, list available tabs, then capture specific tab.

# Start chromium browser with remote debugging enabled
chromium --remote-debugging-port=9222 --url https://duckduckgo.com/?q=web+archive
# List tabs of chromium
warc-browser browser -a
# Archive first tab
warc-browser browser -a archive -t 0

Start a web server serving simple ui, to visualize collected archives.

warc-browser ui

Open your browser at localhost:8080.


software used

  1. github.com/go-rod/rod web automation framework for browser automation
  2. github.com/nlnwa/gowarc for composing warc records
  3. github.com/webrecorder/replayweb.page for visualizing records in web ui.
coverage: 61.2% of statements

About

a cli toolkit for working with web archives

Topics

Resources

License

Stars

Watchers

Forks