⭐ Backup cleanser

Bleanser' stands for 'backup cleanser'.

The idea is figuring out 'redundant' backups and removing them to

save on disk space
same on data access time (see "data access layer")

This is the most relevant to incremental/synthetic style data exports.

It's not necessarily hard to implement for something specific, but the challenge is to do it in a data source agnostic way,
or at least with as minimum effort as possible.

This is possible for example for JSON: if the export from today is a superset of an export from yesterday, you can safely remove the old export. This actually works surprisingly well as is for many data sources.
For a few I've got slight adjustments that normalise them before comparing by removing certain fields that change often, but not very important. For example, Reddit upvotes/downvotes always jump, so I just exclude them from the comparison.
It's similar to extracting the useful fields, but instead it filters the useless ones. That makes it safer in case new fields are added by the backend, I'd rather keep extra data than potentially lose useful information.

related bleanserexportsbackupinfra
[A] * motivation bleanser
- TODO [B] related: hmm. they serve sort of the same purpose??? bleanserbackupchecker
- TODO [D] reddit processing takes quite a bit.. but I guess bleanser will optimize it bleanserhpireddit
- STRT [C] fdupes tool is kinda similar bleanser
[A] * ideas bleanser
- TODO [A] pattern of handling unknown data sources bleansertoblog
- TODO [D] try to guess fields order instead of arbitrary sort? bleanser
- TODO [B] always keep one file per month or something? try to autodetect date bleanser
- TODO [B] safety: run tox first? to protect from crashes bleansersetup
- TODO [B] 'extract' query bleanser
- TODO [B] keep .bleanser file? with a log of all actions bleanser
- TODO [C] implement 'extract' mode later… after writing to blog definitely bleanser
- TODO [C] would be nice if it was possible not to run cleanup step at all if original files were the same… would make it a nice optimization bleanser
- TODO [C] would be nice to support diffs within lines… e.g. if dict ended up with some extra attributes? bleanser
[B] * communication/docs bleanser
- TODO [D] multiway is a bit more speculative bleansertoblog
- TODO [B] kinds of snapshots bleansertoblog
- TODO [B] lastfm is a good one to describe multiway approach? some renames/data glitches etc bleansertoblog
- TODO [C] write about multiprocessing? bleanser
- TODO [C] readme: gotcha about group boundaries not being removed (nad having empty diff) bleanser
- TODO [C] for properly impressive demo should prob run in single threaded mode? bleanser
- TODO [C] foursquare is a good motiation – lots of random changing crap even without the changes of underlying data? bleanser
- TODO [C] measure processing times before and after bleanser? bleansertoblog
- TODO [C] instead of twoway and multiway rename to cumulative and synthetic? bleanser
- TODO [C] maybe instead of delete_dominated use keep_dominated? bleanser
- TODO [C] document what's happening in which case… with a literate test bleanser
- TODO [C] github events via triples would be a good example bleanser
[B] * specific data sources bleanser
- TODO [B] add for takeouts… I even had some script to compare it somewhere bleansertakeout
- STRT [C] github-events – prune via triplet approach? bleanser
- TODO [D] not sure, maybe ignore comment/link karma? it results in lots of differences… bleanserreddit
- TODO [D] lastfm: sometimes might be flaky with dates bleanser
- TODO [C] settingsv2 -> phoneSetsTimestamps column – might be important to handle.. bleanserbluemaestro
- [C] [2021-02-28] Allows Safari history file to be imported to Promnesia by gms8994 · Pull Request #207 · karlicoss/promnesia bleanser
[C] * bugs bleanser
- TODO [D] moving old files – not sure what to do about empty dirs? bleanser
- TODO [B] def limit tmp space for the container… otherwise potentiall potentially might eat all of it bleanser
- TODO [B] perfomance: probs need to unlink archived file after unpacked() helper. this is probs the reason for disk space leak? bleanser
- TODO [C] performance: warn about being disk/tmp intense? bleanser
TODO [D] [2021-01-11] move description to github bleanser
----------------------------------- bleanser
TODO [B] maybe 'dynamic' optimizer for bleanser? and later can use it to actually delete stuff bleanserhpi
- [2021-03-02] I guess HPI could import it as a dependency.. bleanserhpi
CNCL [B] [2021-04-07] performance: Memory Filesystem — PyFilesystem 2.4.13 documentation bleanser
TODO [B] completely format agnostic comparison is unsafe if it's doing some sorting/reordering? bleanser
TODO [B] simple cleanup: for safety, need to treat files as essentially blackboxes (so only compare exact contents by default). Only after maybe explicitly allowing newlines should it use diff bleanser
TODO [C] [2021-03-02] Search results · PyPI bleanser
[C] [2021-02-27] trailofbits/graphtage: A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV. bleanser
TODO [C] proper end2end test — could run against firefox? reinstalled at about 202006, could track by file size changes bleanser
TODO [C] json: sorting stuff might definitely make it more confusing when there is just one volatile attribute that has two values bleanser
[C] [2021-12-30] performance: tried using ramdisk, but seems that the performance is exactly the same? bleanser
TODO [C] to check, implement a script that plots backup 'frequency'? so if there are too many somewhere likely normalisation is broken bleanser
WAIT [C] would be nice to make idempotent… but tricky when it's multiple threads present… bleanser
TODO [D] old code for 'extract' bit bleanserpinboard
TODO [D] hmm, for attributes that can change back and forth in json, sorted strategy isn't the best… ugh bleanser
DONE [A] sqlite: hmm….note sure about cascades… probably need to disable somehow? bleanser
DONE [B] json: could artificially map jsons to line-based format (with full path to the entity?) bleanser
DONE [B] json: some lists are actually lists of different/heterogenous items (e.g. rescuetime), some are homogenous/merely tag-like bleanser

¶related bleanserexportsbackupinfra

¶[A] * motivation bleanser

¶TODO [B] related: hmm. they serve sort of the same purpose??? bleanserbackupchecker

CREATED: [2021-02-11]

¶TODO [D] reddit processing takes quite a bit.. but I guess bleanser will optimize it bleanserhpireddit

CREATED: [2019-05-01]

¶STRT [C] `fdupes` tool is kinda similar bleanser

CREATED: [2022-01-01]

fdupes /data/exports/hypothesis/ -q
hm ok so it dumps
dunno if it's still better to rely on my own implementation… would be easier to test etc probably
maybe double check fdupes output after parsing

¶[A] * ideas bleanser

¶TODO [A] pattern of handling unknown data sources bleansertoblog

CREATED: [2020-12-08]

lower bound
specify data (fields/files etc) to preserve
if you only do that you might miss new useful data/schema changes like renames etc

. ideally they meet here
.. warn if we ended up here, i.e. dropping is not converting with picking. but keep the data

if you only do that you end up with too much garbage
specify data (fileds/files etc) to drop
upper bound

¶TODO [D] try to guess fields order instead of arbitrary sort? bleanser

CREATED: [2022-01-03]

<toplevel> ::: {"album": "", "artist": "john lennon", "date": "1281296836", "name": "jealous guy"}
<toplevel> ::: {"album": "", "artist": "john lennon", "date": "1281296516", "name": "instant karma"}

¶TODO [B] always keep one file per month or something? try to autodetect date bleanser

CREATED: [2021-12-30]

¶TODO [B] safety: run tox first? to protect from crashes bleansersetup

CREATED: [2021-04-11]

¶TODO [B] 'extract' query bleanser

CREATED: [2021-04-09]

might be useful as a sanity check? to ensure stuff isn't deleted by accident? (like foreign key triggers)
e.g.

run extract query first to get a snapshot of data
run cleanup query
run extract query first to ensure the data we care about is there?

¶TODO [B] keep .bleanser file? with a log of all actions bleanser

CREATED: [2021-04-14]

and make possible to override its path if the user doesn't want it in the same dir

¶TODO [C] implement 'extract' mode later… after writing to blog definitely bleanser

CREATED: [2021-04-11]

¶TODO [C] would be nice if it was possible not to run cleanup step at all if original files were the same… would make it a nice optimization bleanser

CREATED: [2022-01-02]

e.g. useful for cleaning up files without unpacking

¶TODO [C] would be nice to support diffs within lines… e.g. if dict ended up with some extra attributes? bleanser

CREATED: [2022-01-04]

on the other hand, it might mean some legit change… e.g. post was edited and extra content added

¶[B] * communication/docs bleanser

¶TODO [D] multiway is a bit more speculative bleansertoblog

CREATED: [2021-04-07]

¶TODO [B] kinds of snapshots bleansertoblog

CREATED: [2021-04-05]

append only (e.g. foursquare, hypothesis)
rolling (e.g. rescuetime, github, reddit)

either way you can think of it as as set of strings

¶TODO [B] lastfm is a good one to describe multiway approach? some renames/data glitches etc bleansertoblog

CREATED: [2022-01-03]

unclear why glitches are happening – could be the backup tool, could be their api

¶TODO [C] write about multiprocessing? bleanser

CREATED: [2021-04-05]

¶TODO [C] readme: gotcha about group boundaries not being removed (nad having empty diff) bleanser

CREATED: [2021-04-10]

¶TODO [C] for properly impressive demo should prob run in single threaded mode? bleanser

CREATED: [2021-04-10]

¶TODO [C] foursquare is a good motiation – lots of random changing crap even without the changes of underlying data? bleanser

CREATED: [2021-04-16]

¶TODO [C] measure processing times before and after bleanser? bleansertoblog

CREATED: [2021-12-30]

¶TODO [C] instead of twoway and multiway rename to cumulative and synthetic? bleanser

CREATED: [2022-01-03]

¶TODO [C] maybe instead of delete_dominated use keep_dominated? bleanser

CREATED: [2022-01-02]

¶TODO [C] document what's happening in which case… with a literate test bleanser

CREATED: [2021-04-10]

e.g. 'all files are same'
only added data
rolling data (some fake datetime stuff with 30d retention)
error in cleaner script

¶TODO [C] github events via triples would be a good example bleanser

CREATED: [2021-02-21]

¶[B] * specific data sources bleanser

¶TODO [B] add for takeouts… I even had some script to compare it somewhere bleansertakeout

CREATED: [2021-04-14]

¶STRT [C] github-events – prune via triplet approach? bleanser

CREATED: [2020-09-05]

¶TODO [D] not sure, maybe ignore comment/link karma? it results in lots of differences… bleanserreddit

CREATED: [2019-09-29]

¶TODO [D] lastfm: sometimes might be flaky with dates bleanser

CREATED: [2022-01-03]

2999 <toplevel> ::: {"album": "best of bowie", "artist": "david bowie", "date": "1527013173", "name": "space oddity - 1999 |    3029 <toplevel> ::: {"album": "best of bowie", "artist": "david bowie", "date": "1527013173", "name": "space oddity - 1999
3000 <toplevel> ::: {"album": "believe ep", "artist": "rob hes", "date": "1527012721", "name": "we rise"}                  |         ---------------------------------------------------------------------------------------------------------------------
3001 <toplevel> ::: {"album": "sviib", "artist": "school of seven bells", "date": "1527012721", "name": "open your eyes"}  |    3030 <toplevel> ::: {"album": "sviib", "artist": "school of seven bells", "date": "1527012721", "name": "open your eyes"}
     ----------------------------------------------------------------------------------------------------------------------|    3031 <toplevel> ::: {"album": "believe ep", "artist": "rob hes", "date": "1527012721", "name": "we rise"}

¶TODO [C] settingsv2 -> phoneSetsTimestamps column – might be important to handle.. bleanserbluemaestro

CREATED: [2022-01-10]

¶[C] [2021-02-28] Allows Safari history file to be imported to Promnesia by gms8994 · Pull Request #207 · karlicoss/promnesia bleanser

So your process with browser history files is to create a new sqlite version incrementally for each period T, and then have promnesia import them individually? I'm trying to figure out how I should "best" be storing the files locally; right now, I just have multiple machines scp the respective history files to a single location so that they can then be covered by index, but some of the history files (Safari in particular) are 40M and take a couple of minutes to process...

¶[C] * bugs bleanser

¶TODO [D] moving old files – not sure what to do about empty dirs? bleanser

CREATED: [2021-04-09]

maybe keep all dirs that were there before – and only remove new empty dirs?

¶TODO [B] def limit tmp space for the container… otherwise potentiall potentially might eat all of it bleanser

CREATED: [2022-01-10]

¶TODO [B] perfomance: probs need to unlink archived file after unpacked() helper. this is probs the reason for disk space leak? bleanser

CREATED: [2022-01-08]

¶TODO [C] performance: warn about being disk/tmp intense? bleanser

CREATED: [2021-04-07]

¶TODO [D] [2021-01-11] move description to github bleanser

¶----------------------------------- bleanser

¶TODO [B] maybe 'dynamic' optimizer for bleanser? and later can use it to actually delete stuff bleanserhpi

CREATED: [2021-02-24]

¶[2021-03-02] I guess HPI could import it as a dependency.. bleanserhpi

¶CNCL [B] [2021-04-07] performance: Memory Filesystem — PyFilesystem 2.4.13 documentation bleanser

could use for processing… maybe via option

¶TODO [B] completely format agnostic comparison is unsafe if it's doing some sorting/reordering? bleanser

CREATED: [2022-01-02]

need to come up with some example..

¶TODO [B] simple cleanup: for safety, need to treat files as essentially blackboxes (so only compare exact contents by default). Only after maybe explicitly allowing newlines should it use diff bleanser

CREATED: [2022-01-05]

otherwise might end up deleting some useful entries by accident… i.e. same problem as I had with rescuetime

¶TODO [C] [2021-03-02] Search results · PyPI bleanser

could name like this…

¶[C] [2021-02-27] trailofbits/graphtage: A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV. bleanser

[2022-01-12] probs only useful to examine manually… unlikely to actually reuse it in bleanser

¶TODO [C] proper end2end test — could run against firefox? reinstalled at about 202006, could track by file size changes bleanser

CREATED: [2021-04-08]

¶TODO [C] json: sorting stuff might definitely make it more confusing when there is just one volatile attribute that has two values bleanser

CREATED: [2021-04-16]

e.g. on foursquare with isMayor: true – hmmm

¶[C] [2021-12-30] performance: tried using ramdisk, but seems that the performance is exactly the same? bleanser

¶TODO [C] to check, implement a script that plots backup 'frequency'? so if there are too many somewhere likely normalisation is broken bleanser

CREATED: [2022-01-02]

¶WAIT [C] would be nice to make idempotent… but tricky when it's multiple threads present… bleanser

CREATED: [2022-01-02]

¶TODO [D] old code for 'extract' bit bleanserpinboard

CREATED: [2021-04-14]

 return pipe(
     '.tags  |= .',
     '.posts |= map({href, description, time, tags})', # TODO maybe just delete hash?
     '.notes |= {notes: .notes | map({id, title, updated_at}), count}',  # TODO hhmm, it keeps length but not content?? odd.
)

¶TODO [D] hmm, for attributes that can change back and forth in json, sorted strategy isn't the best… ugh bleanser

CREATED: [2021-04-08]

¶DONE [A] sqlite: hmm….note sure about cascades… probably need to disable somehow? bleanser

CREATED: [2021-04-08]

¶DONE [B] json: could artificially map jsons to line-based format (with full path to the entity?) bleanser

CREATED: [2021-04-09]

that way might work more reliably… hmm

[2022-01-12] full path to the entity ended up a bad idea (added a test to json processor for that)

¶DONE [B] json: some lists are actually lists of different/heterogenous items (e.g. rescuetime), some are homogenous/merely tag-like bleanser

CREATED: [2021-06-26]

tag-like should be collapsed in the same line
maybe mark 'paths' to the ones that should be treated like separate items? should work for rescuetime I suppose…

⭐ Backup cleanser

Table of Contents

¶related bleanserexportsbackupinfra

¶[A] * motivation bleanser

¶TODO [B] related: hmm. they serve sort of the same purpose??? bleanserbackupchecker

¶TODO [D] reddit processing takes quite a bit.. but I guess bleanser will optimize it bleanserhpireddit

¶STRT [C] fdupes tool is kinda similar bleanser

¶[A] * ideas bleanser

¶TODO [A] pattern of handling unknown data sources bleansertoblog

¶TODO [D] try to guess fields order instead of arbitrary sort? bleanser

¶TODO [B] always keep one file per month or something? try to autodetect date bleanser

¶TODO [B] safety: run tox first? to protect from crashes bleansersetup

¶TODO [B] 'extract' query bleanser

¶TODO [B] keep .bleanser file? with a log of all actions bleanser

¶TODO [C] implement 'extract' mode later… after writing to blog definitely bleanser

¶TODO [C] would be nice if it was possible not to run cleanup step at all if original files were the same… would make it a nice optimization bleanser

¶TODO [C] would be nice to support diffs within lines… e.g. if dict ended up with some extra attributes? bleanser

¶[B] * communication/docs bleanser

¶TODO [D] multiway is a bit more speculative bleansertoblog

¶TODO [B] kinds of snapshots bleansertoblog

¶TODO [B] lastfm is a good one to describe multiway approach? some renames/data glitches etc bleansertoblog

¶TODO [C] write about multiprocessing? bleanser

¶TODO [C] readme: gotcha about group boundaries not being removed (nad having empty diff) bleanser

¶TODO [C] for properly impressive demo should prob run in single threaded mode? bleanser

¶TODO [C] foursquare is a good motiation – lots of random changing crap even without the changes of underlying data? bleanser

¶TODO [C] measure processing times before and after bleanser? bleansertoblog

¶TODO [C] instead of twoway and multiway rename to cumulative and synthetic? bleanser

¶TODO [C] maybe instead of deletedominated use keepdominated? bleanser

¶TODO [C] document what's happening in which case… with a literate test bleanser

¶TODO [C] github events via triples would be a good example bleanser

¶[B] * specific data sources bleanser

¶TODO [B] add for takeouts… I even had some script to compare it somewhere bleansertakeout

¶STRT [C] github-events – prune via triplet approach? bleanser

¶TODO [D] not sure, maybe ignore comment/link karma? it results in lots of differences… bleanserreddit

¶TODO [D] lastfm: sometimes might be flaky with dates bleanser

¶TODO [C] settingsv2 -> phoneSetsTimestamps column – might be important to handle.. bleanserbluemaestro

¶[C] [2021-02-28] Allows Safari history file to be imported to Promnesia by gms8994 · Pull Request #207 · karlicoss/promnesia bleanser

¶[C] * bugs bleanser

¶TODO [D] moving old files – not sure what to do about empty dirs? bleanser

¶TODO [B] def limit tmp space for the container… otherwise potentiall potentially might eat all of it bleanser

¶TODO [B] perfomance: probs need to unlink archived file after unpacked() helper. this is probs the reason for disk space leak? bleanser

¶TODO [C] performance: warn about being disk/tmp intense? bleanser

¶TODO [D] [2021-01-11] move description to github bleanser

¶----------------------------------- bleanser

¶TODO [B] maybe 'dynamic' optimizer for bleanser? and later can use it to actually delete stuff bleanserhpi

¶[2021-03-02] I guess HPI could import it as a dependency.. bleanserhpi

¶CNCL [B] [2021-04-07] performance: Memory Filesystem — PyFilesystem 2.4.13 documentation bleanser

¶TODO [B] completely format agnostic comparison is unsafe if it's doing some sorting/reordering? bleanser

¶TODO [B] simple cleanup: for safety, need to treat files as essentially blackboxes (so only compare exact contents by default). Only after maybe explicitly allowing newlines should it use diff bleanser

¶TODO [C] [2021-03-02] Search results · PyPI bleanser

¶[C] [2021-02-27] trailofbits/graphtage: A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV. bleanser

¶TODO [C] proper end2end test — could run against firefox? reinstalled at about 202006, could track by file size changes bleanser

¶TODO [C] json: sorting stuff might definitely make it more confusing when there is just one volatile attribute that has two values bleanser

¶[C] [2021-12-30] performance: tried using ramdisk, but seems that the performance is exactly the same? bleanser

¶TODO [C] to check, implement a script that plots backup 'frequency'? so if there are too many somewhere likely normalisation is broken bleanser

¶WAIT [C] would be nice to make idempotent… but tricky when it's multiple threads present… bleanser

¶TODO [D] old code for 'extract' bit bleanserpinboard

¶TODO [D] hmm, for attributes that can change back and forth in json, sorted strategy isn't the best… ugh bleanser

¶DONE [A] sqlite: hmm….note sure about cascades… probably need to disable somehow? bleanser

¶DONE [B] json: could artificially map jsons to line-based format (with full path to the entity?) bleanser

¶DONE [B] json: some lists are actually lists of different/heterogenous items (e.g. rescuetime), some are homogenous/merely tag-like bleanser

¶STRT [C] `fdupes` tool is kinda similar bleanser

¶TODO [C] maybe instead of delete_dominated use keep_dominated? bleanser