Web archival

Table of Contents

related webarchiveinfralinkrot

[A] * why? webarchive

TODO motivation: If I do it, I would be able to search on all pages I ever visited webarchivesearchmemex

CREATED: [2019-04-19]

STRT [B] Archiving-URLs - Gwern.net webarchive

CREATED: [2018-06-21]
The most ambitious & total approach to local caching is to set up a proxy to do your browsing through, and record literally all your web traffic; for example, using Live Archiving Proxy (LAP) or WarcProxy which will save as WARC files every page you visit through it. (Zachary Vance explains how to set up a local HTTPS certificate to MITM your HTTPS browsing as well.)

One may be reluctant to go this far, and prefer something lighter-weight, such as periodically extracting a list of visited URLs from one’s web browser and then attempting to archive them.

STRT [2018-11-05] just backup everything you can find in promnesia? webarchivepromnesia

[B] * archivebox webarchivearchivebox

The tool I'm currently using, very decent https://github.com/ArchiveBox/ArchiveBox#readme

STRT [B] ok, first instapaper run webarchivearchivebox

CREATED: [2020-08-11]
[√] 2020-08-11 01:33:33 Update of 252 pages complete (146.68 min)
    - 0 links skipped
    - 228 links updated
    - 24 links had errors
535M	./1597100812.87
609M	./1597100812.31
757M	./1597100812.221
1.1G	./1597100812.173
8.5G	.

Ok, and second run the next day said it's already added all of them to index. Nice!

TODO [B] issues webarchivearchivebox

CREATED: [2020-08-11]

TODO hmm wonder how did it manage to do user mapping??? is 1000 just dome default docker thing? webarchivearchivebox

TODO suggest to use run --rm webarchivearchivebox

TODO crap, timestamps, not shas are used… again?? webarchivearchivebox

TODO ok, need to multithread.. webarchivearchivebox

TODO add command – set maximum limit for data transferred? webarchivearchivebox

TODO prune command – I think I had some scripts already… webarchivearchivebox

TODO index web interface – might be useful to have size? for detecting largest offenders webarchivearchivebox

TODO index web interface – would be nice to mark sites that errored? Not sure what's the actionable outcome of that though webarchivearchivebox

TODO this issue https://github.com/pirate/ArchiveBox/issues/412 webarchivearchivebox

  • run archivebox init
  • run some export webarchivearchivebox
  • run another export (potentially overlapping?, but with new urls) webarchivearchivebox
  • it seems to fail… webarchivearchivebox

[2020-08-11] ok, need to starti without the pdf, screenshot etc… takes too long webarchivearchivebox

also make sure it's possibe to add pdfs as an afterthought?

STRT [B] [2020-08-05] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox webarchivearchivebox

Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index

TODO [B] trying out the new one webarchivearchivebox

CREATED: [2020-10-25]
  • how does it retrieve images?
  • singlefile vs wget – not sure?? singlefile is nice though
  • mercury??? apparently not documented yet, but same as readability?
  • readability is pretty neat – also contains images (as base64)
  • warc??
  • hmm, DOM is probably HTML??

TODO [2020-10-25] would be nice to have parallel execution or something.. webarchivearchivebox

STRT [B] [2020-10-25] hmm, if archiving is interrupted, how to carry on? apparently 'archivebox update'? webarchivearchivebox

  • [2020-10-25] ok, it fetches new data on config change when running update? that's nice webarchivearchivebox

TODO [2020-10-25] media – could def download later/in parallel.. webarchivearchivebox

TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources webarchivearchiveboxpromnesia

CREATED: [2020-08-11]

would be nice to mark different sources as well if possible?

TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver webarchivearchivebox

CREATED: [2018-07-24]

DONE maybe just feed promnesia database to it?? webarchivearchivebox

  • DONE I guess need promnesia provider. is it like my.links? webarchivearchiveboxhpi
  • TODO move run script somewhere else; add ability to put output dir somewhere else webarchivearchivebox

right, so just archive redoes the index? Should run in against wereyouhere I suppose… webarchivearchivebox

TODO [C] commit my changes to archiver, maybe even add the scripts? webarchivearchivebox

TODO figure out 404 etc webarchivearchivebox

[2019-04-06] should run it after I normalise all the wereyouhere links? webarchivearchivebox

I guess filter out all suspicious ones, containing special characters?

[2019-04-16] ok, he's working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74 webarchivearchivebox

TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? webarchivearchivebox

CREATED: [2020-10-25]

TODO [C] status command is kinda similar to my old blame script? (might be on a branch) webarchivearchivebox

CREATED: [2020-10-26]

TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views webarchivearchivebox

CREATED: [2020-10-26]

TODO [D] wonder if my exporters could be useful for archivebox webarchivearchiveboxorgerpromnesia

CREATED: [2019-09-22]

[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox


Storage Requirements
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting FETCH_MEDIA=False to skip audio & video files.

[D] [2019-04-16] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… webarchivearchivebox


Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).

[D] [2019-04-16] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching webarchivearchivebox

re-save index after archiving completes to update titles and urls
emove title prefetching in favor of new FETCH_TITLE archive method

TODO [D] backup config? webarchivearchivebox

CREATED: [2020-10-26]

STRT [B] prioritise never bookmarked over bookmarked with errors webarchive

CREATED: [2018-11-10]

TODO commit it?? webarchive

TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore webarchive

CREATED: [2018-11-15]

e.g. wget -N -E -np -x -H -k -K -S –restrict-file-names=unix -p –user-agent=Bookmark Archiver –no-check-certificate https://charlie-charlie.ru/breakfast
– about 150M

[D] [2018-10-03] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager webarchive

[2018-10-05] wonder how is it different from my bookmark archiver? webarchive

TODO [C] https://github.com/webrecorder/webrecorder webarchive

CREATED: [2020-05-04]

[D] Tweet from @gwern webarchivelinkrot

CREATED: [2020-02-27]


@gwern: @karlicoss @thomas536 Not documented in there yet is my latest archiving tool: https://t.co/If2Ypw1T1M https://t.co/NLh23nrkrh Currently costs 20GB for 7,677 PDFs & self-contained single-file HTML mirrors.

[C] [2019-12-20] Web Archiving Community · pirate/ArchiveBox Wiki webarchivelinkrot

[C] [2019-12-11] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News webarchive


ven if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.

[B] [2020-05-28] site-deaths - IndieWeb webarchivelinkrot

[C] [2019-04-19] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." webarchivelinkrot

[2019-06-13] Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ webarchivesearchlinkrot

DONE [A] [2019-12-22] This Page is Designed to Last webarchivelinkrot

TODO [D] [2019-07-08] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ webarchivelinkrot

[D] [2020-03-06] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331 webarchive

[C] [2021-02-25] Wikipedia:Database download - Wikipedia webarchivewikipedia

pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is approximately 18 GB compressed (expands to over 78 GB when decompressed).

TODO [C] ugh. image preservation is a mess… webarchivewikipedia

CREATED: [2021-02-25]

[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

pages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing.

STRT [C] [2021-02-25] Main Page - Kiwix webarchivepreppingwikipedia

[C] [2021-02-25] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression webarchivekiwixprepping

[–]jeharu54TB 46 points 2 months ago
no support yet for incremental updating, right? bummer.

    permalinkembedsavereportgive awardreply
[–]The_other_kiwix_guy[S] 66 points 2 months ago
We've started working on a prototype but that'll take time and a lot more money than we have. Would not expect anything before another 2-3 years.

hm okay sad.. guess I can do a backup per year or smth for now

[2021-02-25] Wikipedia:Database download - Wikipedia webarchive

The only downside to multistream is that it is marginally larger

TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from webarchivearchivebox

CREATED: [2021-03-26]

or just have a special source for manual notes/exobrainy stuff and another one for the rest?

TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) webarchiveselfarchivebox

CREATED: [2021-03-26]

TODO [B] could also check archive.is api? webarchive

TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally… webarchive

CREATED: [2021-03-21]

TODO [B] pdfs on the other hand are a bit of higher priority? webarchive

CREATED: [2021-03-21]

TODO [C] [2021-03-26] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome webarchivearchivebox

could use this to prune?

[D] [2019-05-25] motivation: If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can. /r/DataHoarder webarchivedatahoarding

Jump to search, settings & sitemap