Web archival

related webarchiveinfralinkrot
[A] * why? webarchive
- TODO motivation: If I do it, I would be able to search on all pages I ever visited webarchivesearchmemex
- STRT [B] Archiving-URLs - Gwern.net webarchive
  - STRT [2018-11-05] just backup everything you can find in promnesia? webarchivepromnesia
[B] * archivebox webarchivearchivebox
- STRT [B] ok, first instapaper run webarchivearchivebox
- TODO [B] issues webarchivearchivebox
- STRT [B] [2020-08-05] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox webarchivearchivebox
- TODO [B] trying out the new one webarchivearchivebox
- TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources webarchivearchiveboxpromnesia
- TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver webarchivearchivebox
- TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? webarchivearchivebox
- TODO [C] status command is kinda similar to my old blame script? (might be on a branch) webarchivearchivebox
- TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views webarchivearchivebox
- TODO [D] wonder if my exporters could be useful for archivebox webarchivearchiveboxorgerpromnesia
- [D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox
- [D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox
- [D] [2019-04-16] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching webarchivearchivebox
- TODO [D] backup config? webarchivearchivebox
STRT [B] prioritise never bookmarked over bookmarked with errors webarchive
- TODO commit it?? webarchive
TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore webarchive
[D] [2018-10-03] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager webarchive
- [2018-10-05] wonder how is it different from my bookmark archiver? webarchive
TODO [C] https://github.com/webrecorder/webrecorder webarchive
[D] Tweet from @gwern webarchivelinkrot
[C] [2019-12-20] Web Archiving Community · pirate/ArchiveBox Wiki webarchivelinkrot
[C] [2019-12-11] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News webarchive
[B] [2020-05-28] site-deaths - IndieWeb webarchivelinkrot
[C] [2019-04-19] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." webarchivelinkrot
[2019-06-13] Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ webarchivesearchlinkrot
DONE [A] [2019-12-22] This Page is Designed to Last webarchivelinkrot
TODO [D] [2019-07-08] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ webarchivelinkrot
[D] [2020-03-06] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331 webarchive
[C] [2021-02-25] Wikipedia:Database download - Wikipedia webarchivewikipedia
TODO [C] ugh. image preservation is a mess… webarchivewikipedia
[2021-02-25] Wikipedia:Database download - Wikipedia webarchive
STRT [C] [2021-02-25] Main Page - Kiwix webarchivepreppingwikipedia
[C] [2021-02-25] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression webarchivekiwixprepping
[2021-02-25] Wikipedia:Database download - Wikipedia webarchive
TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from webarchivearchivebox
TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) webarchiveselfarchivebox
TODO [B] could also check archive.is api? webarchive
TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally… webarchive
TODO [B] pdfs on the other hand are a bit of higher priority? webarchive
TODO [C] [2021-03-26] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome webarchivearchivebox
[D] [2019-05-25] motivation: If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can. /r/DataHoarder webarchivedatahoarding

¶related webarchiveinfralinkrot

¶[A] * why? webarchive

¶TODO motivation: If I do it, I would be able to search on all pages I ever visited webarchivesearchmemex

CREATED: [2019-04-19]

¶STRT [B] Archiving-URLs - Gwern.net webarchive

CREATED: [2018-06-21]

The most ambitious & total approach to local caching is to set up a proxy to do your browsing through, and record literally all your web traffic; for example, using Live Archiving Proxy (LAP) or WarcProxy which will save as WARC files every page you visit through it. (Zachary Vance explains how to set up a local HTTPS certificate to MITM your HTTPS browsing as well.)

One may be reluctant to go this far, and prefer something lighter-weight, such as periodically extracting a list of visited URLs from one’s web browser and then attempting to archive them.

¶STRT [2018-11-05] just backup everything you can find in promnesia? webarchivepromnesia

¶[B] * archivebox webarchivearchivebox

The tool I'm currently using, very decent https://github.com/ArchiveBox/ArchiveBox#readme

¶STRT [B] ok, first instapaper run webarchivearchivebox

CREATED: [2020-08-11]

[√] 2020-08-11 01:33:33 Update of 252 pages complete (146.68 min)
    - 0 links skipped
    - 228 links updated
    - 24 links had errors
...
535M	./1597100812.87
609M	./1597100812.31
757M	./1597100812.221
1.1G	./1597100812.173
8.5G	.

Ok, and second run the next day said it's already added all of them to index. Nice!

¶TODO [B] issues webarchivearchivebox

CREATED: [2020-08-11]

¶TODO hmm wonder how did it manage to do user mapping??? is 1000 just dome default docker thing? webarchivearchivebox

¶TODO suggest to use `run --rm` webarchivearchivebox

¶TODO crap, timestamps, not shas are used… again?? webarchivearchivebox

¶TODO ok, need to multithread.. webarchivearchivebox

¶TODO add command – set maximum limit for data transferred? webarchivearchivebox

¶TODO prune command – I think I had some scripts already… webarchivearchivebox

¶TODO index web interface – might be useful to have size? for detecting largest offenders webarchivearchivebox

¶TODO index web interface – would be nice to mark sites that errored? Not sure what's the actionable outcome of that though webarchivearchivebox

¶TODO this issue https://github.com/pirate/ArchiveBox/issues/412 webarchivearchivebox

run archivebox init

run some export webarchivearchivebox
run another export (potentially overlapping?, but with new urls) webarchivearchivebox
it seems to fail… webarchivearchivebox

¶[2020-08-11] ok, need to starti without the pdf, screenshot etc… takes too long webarchivearchivebox

also make sure it's possibe to add pdfs as an afterthought?

¶STRT [B] [2020-08-05] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox webarchivearchivebox

Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index

¶TODO [B] trying out the new one webarchivearchivebox

CREATED: [2020-10-25]

how does it retrieve images?
singlefile vs wget – not sure?? singlefile is nice though
mercury??? apparently not documented yet, but same as readability?
readability is pretty neat – also contains images (as base64)
warc??
hmm, DOM is probably HTML??

¶TODO [2020-10-25] would be nice to have parallel execution or something.. webarchivearchivebox

¶STRT [B] [2020-10-25] hmm, if archiving is interrupted, how to carry on? apparently 'archivebox update'? webarchivearchivebox

[2020-10-25] ok, it fetches new data on config change when running update? that's nice webarchivearchivebox

¶TODO [2020-10-25] media – could def download later/in parallel.. webarchivearchivebox

¶TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources webarchivearchiveboxpromnesia

CREATED: [2020-08-11]

would be nice to mark different sources as well if possible?

¶TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver webarchivearchivebox

CREATED: [2018-07-24]

¶DONE maybe just feed promnesia database to it?? webarchivearchivebox

DONE I guess need promnesia provider. is it like my.links? webarchivearchiveboxhpi
TODO move run script somewhere else; add ability to put output dir somewhere else webarchivearchivebox

¶right, so just archive redoes the index? Should run in against wereyouhere I suppose… webarchivearchivebox

¶TODO [C] commit my changes to archiver, maybe even add the scripts? webarchivearchivebox

¶TODO figure out 404 etc webarchivearchivebox

¶[2019-04-06] should run it after I normalise all the wereyouhere links? webarchivearchivebox

I guess filter out all suspicious ones, containing special characters?

¶[2019-04-16] ok, he's working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74 webarchivearchivebox

¶TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? webarchivearchivebox

CREATED: [2020-10-25]

¶TODO [C] status command is kinda similar to my old blame script? (might be on a branch) webarchivearchivebox

CREATED: [2020-10-26]

¶TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views webarchivearchivebox

CREATED: [2020-10-26]

¶TODO [D] wonder if my exporters could be useful for archivebox webarchivearchiveboxorgerpromnesia

CREATED: [2019-09-22]

¶[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox

https://github.com/pirate/ArchiveBox/

Storage Requirements
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting FETCH_MEDIA=False to skip audio & video files.

¶[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox

https://github.com/pirate/ArchiveBox/

Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).

¶[D] [2019-04-16] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching webarchivearchivebox

re-save index after archiving completes to update titles and urls
emove title prefetching in favor of new FETCH_TITLE archive method

¶TODO [D] backup config? webarchivearchivebox

CREATED: [2020-10-26]

¶STRT [B] prioritise never bookmarked over bookmarked with errors webarchive

CREATED: [2018-11-10]

¶TODO commit it?? webarchive

¶TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore webarchive

CREATED: [2018-11-15]

e.g. wget -N -E -np -x -H -k -K -S –restrict-file-names=unix -p –user-agent=Bookmark Archiver –no-check-certificate https://charlie-charlie.ru/breakfast
– about 150M

¶[D] [2018-10-03] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager webarchive

https://github.com/kanishka-linux/reminiscence

¶[2018-10-05] wonder how is it different from my bookmark archiver? webarchive

¶TODO [C] https://github.com/webrecorder/webrecorder webarchive

CREATED: [2020-05-04]

¶[D] Tweet from @gwern webarchivelinkrot

CREATED: [2020-02-27]

https://twitter.com/gwern/status/1233112807253716992

@gwern: @karlicoss @thomas536 Not documented in there yet is my latest archiving tool: https://t.co/If2Ypw1T1M https://t.co/NLh23nrkrh Currently costs 20GB for 7,677 PDFs & self-contained single-file HTML mirrors.

¶[C] [2019-12-20] Web Archiving Community · pirate/ArchiveBox Wiki webarchivelinkrot

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

¶[C] [2019-12-11] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News webarchive

https://news.ycombinator.com/item?id=21737696

ven if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.

¶[B] [2020-05-28] site-deaths - IndieWeb webarchivelinkrot

¶[C] [2019-04-19] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." webarchivelinkrot

https://twitter.com/worrydream/status/478087637031325697

¶[2019-06-13] Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ webarchivesearchlinkrot

¶DONE [A] [2019-12-22] This Page is Designed to Last webarchivelinkrot

https://jeffhuang.com/designed_to_last/

¶TODO [D] [2019-07-08] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ webarchivelinkrot

¶[D] [2020-03-06] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331 webarchive

¶[C] [2021-02-25] Wikipedia:Database download - Wikipedia webarchivewikipedia

pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is approximately 18 GB compressed (expands to over 78 GB when decompressed).

¶TODO [C] ugh. image preservation is a mess… webarchivewikipedia

CREATED: [2021-02-25]

¶[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

pages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing.

¶STRT [C] [2021-02-25] Main Page - Kiwix webarchivepreppingwikipedia

¶[C] [2021-02-25] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression webarchivekiwixprepping

[–]jeharu54TB 46 points 2 months ago
no support yet for incremental updating, right? bummer.

    permalinkembedsavereportgive awardreply
[–]The_other_kiwix_guy[S] 66 points 2 months ago
We've started working on a prototype but that'll take time and a lot more money than we have. Would not expect anything before another 2-3 years.

hm okay sad.. guess I can do a backup per year or smth for now

¶[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

The only downside to multistream is that it is marginally larger

¶TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from webarchivearchivebox

CREATED: [2021-03-26]

or just have a special source for manual notes/exobrainy stuff and another one for the rest?
https://github.com/ArchiveBox/ArchiveBox/issues/660

¶TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) webarchiveselfarchivebox

CREATED: [2021-03-26]

¶TODO [B] could also check archive.is api? webarchive

CREATED: [2021-03-21]

e.g. it archives medium-like stuff? https://archive.is/20181031123930/https://howwegettonext.com/exploring-the-future-without-cyberpunks-neon-and-noir-8e23562819e3

¶TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally… webarchive

CREATED: [2021-03-21]

¶TODO [B] pdfs on the other hand are a bit of higher priority? webarchive

CREATED: [2021-03-21]

¶TODO [C] [2021-03-26] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome webarchivearchivebox

could use this to prune?

¶[D] [2019-05-25] motivation: If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can. /r/DataHoarder webarchivedatahoarding

Web archival

Table of Contents

¶related webarchiveinfralinkrot

¶[A] * why? webarchive

¶TODO motivation: If I do it, I would be able to search on all pages I ever visited webarchivesearchmemex

¶STRT [B] Archiving-URLs - Gwern.net webarchive

¶STRT [2018-11-05] just backup everything you can find in promnesia? webarchivepromnesia

¶[B] * archivebox webarchivearchivebox

¶STRT [B] ok, first instapaper run webarchivearchivebox

¶TODO [B] issues webarchivearchivebox

¶TODO hmm wonder how did it manage to do user mapping??? is 1000 just dome default docker thing? webarchivearchivebox

¶TODO suggest to use run --rm webarchivearchivebox

¶TODO crap, timestamps, not shas are used… again?? webarchivearchivebox

¶TODO ok, need to multithread.. webarchivearchivebox

¶TODO add command – set maximum limit for data transferred? webarchivearchivebox

¶TODO prune command – I think I had some scripts already… webarchivearchivebox

¶TODO index web interface – might be useful to have size? for detecting largest offenders webarchivearchivebox

¶TODO index web interface – would be nice to mark sites that errored? Not sure what's the actionable outcome of that though webarchivearchivebox

¶TODO this issue https://github.com/pirate/ArchiveBox/issues/412 webarchivearchivebox

¶[2020-08-11] ok, need to starti without the pdf, screenshot etc… takes too long webarchivearchivebox

¶STRT [B] [2020-08-05] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox webarchivearchivebox

¶TODO [B] trying out the new one webarchivearchivebox

¶TODO [2020-10-25] would be nice to have parallel execution or something.. webarchivearchivebox

¶STRT [B] [2020-10-25] hmm, if archiving is interrupted, how to carry on? apparently 'archivebox update'? webarchivearchivebox

¶TODO [2020-10-25] media – could def download later/in parallel.. webarchivearchivebox

¶TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources webarchivearchiveboxpromnesia

¶TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver webarchivearchivebox

¶DONE maybe just feed promnesia database to it?? webarchivearchivebox

¶right, so just archive redoes the index? Should run in against wereyouhere I suppose… webarchivearchivebox

¶TODO [C] commit my changes to archiver, maybe even add the scripts? webarchivearchivebox

¶TODO figure out 404 etc webarchivearchivebox

¶[2019-04-06] should run it after I normalise all the wereyouhere links? webarchivearchivebox

¶[2019-04-16] ok, he's working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74 webarchivearchivebox

¶TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? webarchivearchivebox

¶TODO [C] status command is kinda similar to my old blame script? (might be on a branch) webarchivearchivebox

¶TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views webarchivearchivebox

¶TODO [D] wonder if my exporters could be useful for archivebox webarchivearchiveboxorgerpromnesia

¶[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox

¶[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox

¶[D] [2019-04-16] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching webarchivearchivebox

¶TODO [D] backup config? webarchivearchivebox

¶STRT [B] prioritise never bookmarked over bookmarked with errors webarchive

¶TODO commit it?? webarchive

¶TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore webarchive

¶[D] [2018-10-03] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager webarchive

¶[2018-10-05] wonder how is it different from my bookmark archiver? webarchive

¶TODO [C] https://github.com/webrecorder/webrecorder webarchive

¶[D] Tweet from @gwern webarchivelinkrot

¶[C] [2019-12-20] Web Archiving Community · pirate/ArchiveBox Wiki webarchivelinkrot

¶[C] [2019-12-11] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News webarchive

¶[B] [2020-05-28] site-deaths - IndieWeb webarchivelinkrot

¶[C] [2019-04-19] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." webarchivelinkrot

¶[2019-06-13] Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ webarchivesearchlinkrot

¶DONE [A] [2019-12-22] This Page is Designed to Last webarchivelinkrot

¶TODO [D] [2019-07-08] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ webarchivelinkrot

¶[D] [2020-03-06] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331 webarchive

¶[C] [2021-02-25] Wikipedia:Database download - Wikipedia webarchivewikipedia

¶TODO [C] ugh. image preservation is a mess… webarchivewikipedia

¶[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

¶STRT [C] [2021-02-25] Main Page - Kiwix webarchivepreppingwikipedia

¶[C] [2021-02-25] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression webarchivekiwixprepping

¶[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

¶TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from webarchivearchivebox

¶TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) webarchiveselfarchivebox

¶TODO [B] could also check archive.is api? webarchive

¶TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally… webarchive

¶TODO [B] pdfs on the other hand are a bit of higher priority? webarchive

¶TODO [C] [2021-03-26] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome webarchivearchivebox

¶[D] [2019-05-25] motivation: If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can. /r/DataHoarder webarchivedatahoarding

¶TODO suggest to use `run --rm` webarchivearchivebox