Web archival
Table of Contents
- related infralinkrot
- [A] * why?
- [B] * archivebox archivebox
- STRT [B] ok, first instapaper run archivebox
- TODO [B] issues archivebox
- TODO hmm wonder how did it manage to do user mapping??? is 1000 just dome default docker thing? archivebox
- TODO suggest to use
run --rm
archivebox - TODO crap, timestamps, not shas are used… again?? archivebox
- TODO ok, need to multithread.. archivebox
- TODO add command – set maximum limit for data transferred? archivebox
- TODO prune command – I think I had some scripts already… archivebox
- TODO index web interface – might be useful to have size? for detecting largest offenders archivebox
- TODO index web interface – would be nice to mark sites that errored? Not sure what's the actionable outcome of that though archivebox
- TODO this issue https://github.com/pirate/ArchiveBox/issues/412 archivebox
- archivebox ok, need to starti without the pdf, screenshot etc… takes too long
- STRT [B] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox archivebox
- TODO [B] trying out the new one archivebox
- TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources archiveboxpromnesia
- TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver archivebox
- DONE maybe just feed promnesia database to it?? archivebox
- right, so just archive redoes the index? Should run in against wereyouhere I suppose… archivebox
- TODO [C] commit my changes to archiver, maybe even add the scripts? archivebox
- TODO figure out 404 etc archivebox
- archivebox should run it after I normalise all the wereyouhere links?
- archivebox ok, he's working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74
- TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? archivebox
- TODO [C] status command is kinda similar to my old blame script? (might be on a branch) archivebox
- TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views archivebox
- TODO [D] wonder if my exporters could be useful for archivebox archiveboxorgerpromnesia
- [D] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… archivebox
- [D] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… archivebox
- [D] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching archivebox
- TODO [D] backup config? archivebox
- STRT [B] prioritise never bookmarked over bookmarked with errors
- TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore
- [D] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager
- TODO [C] https://github.com/webrecorder/webrecorder
- [D] Tweet from @gwern linkrot
- [C] Web Archiving Community · pirate/ArchiveBox Wiki linkrot
- [C] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News
- [B] site-deaths - IndieWeb linkrot
- [C] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." linkrot
- searchlinkrot Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/
- DONE [A] This Page is Designed to Last linkrot
- TODO [D] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ linkrot
- [D] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331
- [C] Wikipedia:Database download - Wikipedia wikipedia
- TODO [C] ugh. image preservation is a mess… wikipedia
- Wikipedia:Database download - Wikipedia
- STRT [C] Main Page - Kiwix preppingwikipedia
- [C] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression kiwixprepping
- Wikipedia:Database download - Wikipedia
- TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from archivebox
- TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) selfarchivebox
- TODO [B] could also check archive.is api?
- TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally…
- TODO [B] pdfs on the other hand are a bit of higher priority?
- TODO [C] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome archivebox
- [D] motivation: If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can. /r/DataHoarder datahoarding
¶related infralinkrot
¶[A] * why?
¶TODO motivation: If I do it, I would be able to search on all pages I ever visited searchmemex
¶STRT [B] Archiving-URLs - Gwern.net
The most ambitious & total approach to local caching is to set up a proxy to do your browsing through, and record literally all your web traffic; for example, using Live Archiving Proxy (LAP) or WarcProxy which will save as WARC files every page you visit through it. (Zachary Vance explains how to set up a local HTTPS certificate to MITM your HTTPS browsing as well.) One may be reluctant to go this far, and prefer something lighter-weight, such as periodically extracting a list of visited URLs from one’s web browser and then attempting to archive them.
¶STRT just backup everything you can find in promnesia? promnesia
¶[B] * archivebox archivebox
The tool I'm currently using, very decent https://github.com/ArchiveBox/ArchiveBox#readme
¶STRT [B] ok, first instapaper run archivebox
[√] 2020-08-11 01:33:33 Update of 252 pages complete (146.68 min) - 0 links skipped - 228 links updated - 24 links had errors ... 535M ./1597100812.87 609M ./1597100812.31 757M ./1597100812.221 1.1G ./1597100812.173 8.5G .
Ok, and second run the next day said it's already added all of them to index. Nice!
¶TODO [B] issues archivebox
¶TODO hmm wonder how did it manage to do user mapping??? is 1000 just dome default docker thing? archivebox
¶TODO suggest to use run --rm
archivebox
¶TODO crap, timestamps, not shas are used… again?? archivebox
¶TODO ok, need to multithread.. archivebox
¶TODO add command – set maximum limit for data transferred? archivebox
¶TODO prune command – I think I had some scripts already… archivebox
¶TODO index web interface – might be useful to have size? for detecting largest offenders archivebox
¶TODO index web interface – would be nice to mark sites that errored? Not sure what's the actionable outcome of that though archivebox
¶TODO this issue https://github.com/pirate/ArchiveBox/issues/412 archivebox
- run archivebox init
¶ ok, need to starti without the pdf, screenshot etc… takes too long archivebox
also make sure it's possibe to add pdfs as an afterthought?
¶STRT [B] Release Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index · pirate/ArchiveBox archivebox
Major new ArchiveBox version, with a brand new CLI, UI, and SQLite index
¶TODO [B] trying out the new one archivebox
- how does it retrieve images?
- singlefile vs wget – not sure?? singlefile is nice though
- mercury??? apparently not documented yet, but same as readability?
- readability is pretty neat – also contains images (as base64)
- warc??
- hmm, DOM is probably HTML??
¶TODO would be nice to have parallel execution or something.. archivebox
¶STRT [B] hmm, if archiving is interrupted, how to carry on? apparently 'archivebox update'? archivebox
¶TODO media – could def download later/in parallel.. archivebox
¶TODO [C] ok, I think I just want to take promnesia and run it against all non-browser sources archiveboxpromnesia
would be nice to mark different sources as well if possible?
¶TODO [C] bookmark Archiver https://pirate.github.io/bookmark-archiver archivebox
¶DONE maybe just feed promnesia database to it?? archivebox
¶right, so just archive redoes the index? Should run in against wereyouhere I suppose… archivebox
¶TODO [C] commit my changes to archiver, maybe even add the scripts? archivebox
¶TODO figure out 404 etc archivebox
¶ should run it after I normalise all the wereyouhere links? archivebox
I guess filter out all suspicious ones, containing special characters?
¶ ok, he's working on django backend where we can use hashes https://github.com/pirate/ArchiveBox/issues/74 archivebox
¶TODO [C] I guess some sites (with comments) – useful to update regularly, but most are okay with one snapshot? archivebox
¶TODO [C] status command is kinda similar to my old blame script? (might be on a branch) archivebox
¶TODO [C] only save mp3 for youtube videos? I guess it should be selective… or maybe dpeend on number of views archivebox
¶TODO [D] wonder if my exporters could be useful for archivebox archiveboxorgerpromnesia
¶[D] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… archivebox
https://github.com/pirate/ArchiveBox/
Storage Requirements Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting FETCH_MEDIA=False to skip audio & video files.
¶[D] pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more… archivebox
https://github.com/pirate/ArchiveBox/
Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs).
¶[D] [pirate/ArchiveBox] Bugfixes, new data integrity and invariant checks, remove title prefetching archivebox
re-save index after archiving completes to update titles and urls emove title prefetching in favor of new FETCH_TITLE archive method
¶TODO [D] backup config? archivebox
¶STRT [B] prioritise never bookmarked over bookmarked with errors
¶TODO commit it??
¶TODO [C] some links are pretty crazy… maybe prune huge pages manually and ignore
e.g. wget -N -E -np -x -H -k -K -S –restrict-file-names=unix -p –user-agent=Bookmark Archiver –no-check-certificate https://charlie-charlie.ru/breakfast
– about 150M
¶[D] kanishka-linux/reminiscence: Self-Hosted Bookmark And Archive Manager
¶ wonder how is it different from my bookmark archiver?
¶TODO [C] https://github.com/webrecorder/webrecorder
¶[D] Tweet from @gwern linkrot
https://twitter.com/gwern/status/1233112807253716992
@gwern: @karlicoss @thomas536 Not documented in there yet is my latest archiving tool: https://t.co/If2Ypw1T1M https://t.co/NLh23nrkrh Currently costs 20GB for 7,677 PDFs & self-contained single-file HTML mirrors.
¶[C] Web Archiving Community · pirate/ArchiveBox Wiki linkrot
¶[C] Verizon/Yahoo Blocking Attempts to Archive Yahoo Groups – Deletion: Dec. 14 | Hacker News
https://news.ycombinator.com/item?id=21737696
ven if we still had the Library of Alexandria, it may have shed zero light on the actual lives of citizens. Archiving content on the internet means capturing thousands of individual level perspectives and experiences. We don't know what will end up being important to historians 50 or 100 years from now. I would bet there are dozens if not hundreds of historians that would give anything for a record of their favorite time period that contains even a fraction of the amount of content today's archive efforts are storing.
¶[B] site-deaths - IndieWeb linkrot
¶[C] Bret Victor on Twitter: "60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress expects 60% of their collection to go up in smoke every decade." linkrot
¶ Time Travel: Find Mementos in Internet Archive, Archive-It, British Library, archive.today, GitHub and many more! http://timetravel.mementoweb.org/ searchlinkrot
¶DONE [A] This Page is Designed to Last linkrot
¶TODO [D] Fund: On-Demand Web Archiving of Annotated Pages – Hypothesis https://web.hypothes.is/blog/fund-on-demand-web-archiving-of-annotated-pages/ linkrot
¶[D] Archiving URLs | Hacker News https://news.ycombinator.com/item?id=6504331
¶[C] Wikipedia:Database download - Wikipedia wikipedia
pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is approximately 18 GB compressed (expands to over 78 GB when decompressed).
¶TODO [C] ugh. image preservation is a mess… wikipedia
¶ Wikipedia:Database download - Wikipedia
pages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing.
¶STRT [C] Main Page - Kiwix preppingwikipedia
¶[C] jeharu comments on The full English Wikipedia on Kiwix now weighs 79Gb instead of 94Gb thanks to improvements in image compression kiwixprepping
[–]jeharu54TB 46 points 2 months ago no support yet for incremental updating, right? bummer. permalinkembedsavereportgive awardreply [–]The_other_kiwix_guy[S] 66 points 2 months ago We've started working on a prototype but that'll take time and a lot more money than we have. Would not expect anything before another 2-3 years.
hm okay sad.. guess I can do a backup per year or smth for now
¶ Wikipedia:Database download - Wikipedia
The only downside to multistream is that it is marginally larger
¶TODO [C] would be nice to maybe tag urls… e.g. which source they are coming from archivebox
or just have a special source for manual notes/exobrainy stuff and another one for the rest?
https://github.com/ArchiveBox/ArchiveBox/issues/660
¶TODO [B] def archive things I post (e.g. referenced in my own tweets/comments etc) selfarchivebox
¶TODO [B] could also check archive.is api?
e.g. it archives medium-like stuff? https://archive.is/20181031123930/https://howwegettonext.com/exploring-the-future-without-cyberpunks-neon-and-noir-8e23562819e3
¶TODO [C] hmm, a bit confused about archive.is – how reliable is it? is it backed up somewhere? perhaps still should save stuff from there locally…
¶TODO [B] pdfs on the other hand are a bit of higher priority?
¶TODO [C] AdGuardHome/whotracksme.json at master · AdguardTeam/AdGuardHome archivebox
could use this to prune?