What data on myself I collect and why?

How I am using 50+ sources of my personal data

This is the list of personal data sources I use or planning to use with rough guides on how to get your hands on that data if you want it as well.

It's still incomplete and I'm going to update it regularly.

My goal is automating data collection to the maximum extent possible and making it work in the background, so one can set up pipelines once and hopefully never think about it again.

This is kind of a follow-up on my previous post on the sad state of personal data, and part of my personal way of getting around this sad state.

If you're terrified by the long list, you can jump straight into "Data consumers" section to find out how I use it.

1 Why do you collect X? How do you use your data?

All things considered, I think it's a fair question! Why bother with all this infrastructure and hoard the data if you never use it?

In the next section, I will elaborate on each specific data source, but to start with I'll list the rationales that all of them share:

backup

It may feel unnecessary, but shit happens. What if your device dies, account gets suspended for some reason or the company goes bust?

lifelogging

Most data in digital form got timestamps, so automatically, without manual effort, constitutes data for your timeline.

I want to remember more, be able to review my past and bring back and reflect on memories. Practicing lifelogging helps with that.

It feels very wrong that things can be forgotten and lost forever. It's understandable from the neuroscience point of view, i.e. the brain has limited capacity and it would be too distracting to remember everything all the time. That said, I want to have a choice whether to forget or remember events, and I'd like to be able to potentially access forgotten ones.

quantified self

Most collected digital data is somewhat quantitative and can be used to analyze your body or mind.

2 What do I collect/want to collect?

As I mentioned, most of the collected data serve as a means of backup/lifelogging/quantified self, so I won't mention them again in the 'Why' sections.

All my data collection pipelines are automatic unless mentioned otherwise.

Some scripts are still private so if you want to know more, let me know so I can prioritize sharing them.

Amazon

How: jbms/finance-dl

Why:

  • was planning to correlate them with monzo/HSBC transactions, but haven't got to it yet

Arbtt (desktop time tracker)

How: arbtt-capture

Why:

  • haven't used it yet, but it could be a rich source of lifelogging context

Bitbucket (repositories)

How: samkuehn/bitbucket-backup

Why:

  • proved especially useful considering Atlassian is going to wipe mercurial repositories

    I've got lots of private mercurial repositories with university homework and other early projects, and it's sad to think of people who will lose theirs during this wipe.

Bluemaestro (environment sensor)

How: sensor syncs with phone app via Bluetooth, /data/data/com.bluemaestro.tempo_utility/databases/ is regularly copied to grab the data.

Why:

  • temperature during sleep data for the dashboard
  • lifelogging: capturing weather conditions information

    E.g. I can potentially see temperature/humidity readings along with my photos from hiking or skiing.

Blood

How: via thriva, data imported manually into an org-mode table (not doing too frequently so wasn't worth automated scraping)

Also tracked glucose and ketones (with freestyle libre) for a few days out of curiosity, also didn't bother automating it.

Why:

  • contributes to the dashboard, could be a good way of establishing your baselines

Browser history (Firefox/Chrome)

How: custom scripts, copying the underlying sqlite databases directly, running on my computers and phone.

Why:

Emfit QS (sleep tracker)

Emfit QS is kind of a medical grade sleep tracker. It's more expensive than wristband ones (e.g. fitbit, jawbone) but also more reliable and gives more data.

How: emfitexport.

Why:

Endomondo

How: Endomondo collects GPS data, and HR data (via Wahoo Tickr X strap). Then, karlicoss/endoexport.

Why:

Facebook

How: manual archive export.

I barely use Facebook, so don't even bother doing it regularly.

Feedbin

How: via API

Why:

Feedly

How: via API

Why:

Fitbit

How: manual CSV export, as I only used it for few weeks. Then the sync stopped working and I had to return it. However, it seems possible via API.

Why:

Foursquare/Swarm

How: via API

Github (repositories)

How: github-backup

Why:

  • capable of exporting starred repositories as well, so if the authors delete them I will still have them

Github (events)

How: manually requested archive (once), after that automatic karlicoss/ghexport

Why:

Gmail

How: imap-backup, Google Takeout

Why:

Goodreads

Google takeout

How: semi-automatic.

  • only manual step: enable scheduled exports (you can schedule 6 per year at a time), and choose to keep it on Google Drive in export settings
  • mount your Google Drive (e.g. via google-drive-ocamlfuse)
  • keep a script that checks mounted Google Drive for fresh takeout and moves it somewhere safe

Why:

  • Google collects lots of data, which you could put to some good use. However, old data is getting wiped, so it's important to export Takeout regularly.
  • better browsing history
  • (potentially) search history for promnesia
  • search in youtube watch history
  • location data for lifelogging and the dashboard (activity)

TODO Hackernews

How: haven't got to it yet. It's going to require:

  • extracting upvotes/saved items via web scraping since Hackernews doesn't offer an API for that. Hopefully, there is an existing library for that.
  • I'm also using Materialistic app that has its own 'saved' posts and doesn't synchronize with Hackernews.

    Exporting them is going to require copying the database directly from the app private storage.

Why: same reasons as Reddit.

HSBC bank

How: manual exports of monthly PDFs with transactions. They don't really offer API, so unless you want to web scrape and deal with 2FA, it seems it's the best you can do.

Why

Instapaper

How: karlicoss/instapexport

Why:

Jawbone

How: via API. Jawbone is dead now, so if you haven't exported it already, likely your data is lost forever.

Why:

Kindle

How: manually exported MyClippings.txt from Kindle. Potentially can be automated similarly to Kobo.

Why:

Kobo reader

How: almost automatic via karlicoss/kobuddy. Manual step: having to connect your reader via USB now and then.

Why:

Monzo bank

How: karlicoss/monzoexport

Why:

  • automatic personal finance, fed into hledger

Nomie

How: regular copies of /data/data/io.nomie.pro/files/_pouch_events and /data/data/io.nomie.pro/files/_pouch_trackers

Why:

  • could be a great tool for detailed lifelogging if you're into it

Nutrition

I tracked almost all nutrition data for stuff I ingested over the course of a year.

How: I found most existing apps/projects clumsy and unsatisfactory, so I developed my own system. Not even a proper app, something simpler, basically a domain-specific language in Python to track it.

Tracking process was simply editing a python file and adding entries like:

# file: food_2017.py
july_09 = F(
  [  # lunch
       spinach * bag,
       tuna_spring_water * can,       # can size for this tuna is 120g
       beans_broad_wt    * can * 0.5, # half can. can size for broad beans is 200g
       onion_red_tsc     * gr(115)  , # grams, explicit
       cheese_salad_tsc  * 100,       # grams, implicit as it makes sense for cheese
       lime, # 1 fruit, implicit
  ],
  [
     # dinner...
  ],
  tea_black * 10,     # cups, implicit
  wine_red * ml * 150, # ml, explicit
)

july_10 = ... # more logs

Comments added for clarity of course, so it'd be more compact normally.

Then some code was used for processing, calculating, visualizing, etc.

Having a real programming language instead of an app let me make it very flexible and expressive, e.g.:

  • I could define composite dishes as Python objects, and then easily reuse them.

    E.g. if I made four servings of soup on 10.08.2018, ate one immediately and froze other three I would define something like soup_20180810 = [...], and then I can simply reuse soup_20180810 when I eat it again. (date was easy to find out as I label food when put it in the freezer anyway)

  • I could make many things implicit, making it pretty expressive without spending time on unnecessary typing
  • I rarely had to in nutrient composition manually, I just pasted the product link to supermarket website and had an automatic script to parse nutrient information
  • For micronutrients (that usually aren't listed on labels) I used the USDA sqlite database

The hard thing was actually not entering, but rather not having nutrition information if you're eating out. That year I was mostly cooking my own food, so tracking was fairly easy.

Also I was more interested in lower bounds, (e.g. "do I consume at least recommended amount of micronutrients"), so not having logged food now and then was fine for me.

Why:

  • I mostly wanted to learn about food composition and how it relates to my diet, and I did

    That logging motivated me to learn about different foods and try them out while keeping dishes balanced. I cooked so many different things, made my diet way more varied and became less picky.

    I stopped because cooking did take some time and I actually realized that as long as I actually vary food and try to eat everything now and then, I hit all recommended amounts of micronutrients, so I stopped. It's kind of an obvious thing that everyone recommends, but one thing is hearing it as a common wisdom and completely different is coming to the same conclusion from your data.

  • nutritional information contributes to dashboard

Photos

How: no extra effort required if you sync/organize your photos and videos now and then.

Why:

  • obvious source of lifelogging, in addition comes with GPS data

PDF annotations

As in, native PDF annotations.

How: nothing needs to be done, PDFs are local to your computer. You do need some tools to crawl your filesystem and extract the annotations.

Why:

  • experience of using your PDF annotations (e.g. searching) is extremely poor

    I'm improving this by using orger.

Plaintext notes

Mostly this refers to org-mode files, which I use for notekeeping and logging.

How: nothing needs to be done, they are local.

Why:

Pocket

How: karlicoss/pockexport

Why:

Reddit

How: karlicoss/rexport

Why:

Remember the Milk

How: ical export from the API.

Why:

  • better search

    I stopped using RTM in favor of org-mode, but I can still easily find my old task and notes.

Rescuetime

How: karlicoss/rescuexport

Why:

  • richer contexts for lifelogging

Shell history

How: many shells support keeping timestamps along your commands in history.

E.g. "Remember all your bash history forever".

Why:

  • potentially can be useful for detailed lifelogging

Sleep

Apart from automatic collection of HR data, etc., I collect some extra stats like:

  • whether I woke up on my own or after alarm
  • whether I still feel sleepy shortly after waking up
  • whether I had dreams (and I log dreams if I did)
  • I log every time I feel sleepy throughout the day

How: org-mode, via org-capture into table. Alternatively, you could use a spreadsheet for that as well.

Why:

  • I think it's important to find connections between subjective feelings and objective stats like amount of exercise, sleep HR, etc., so I'm trying to find correlations using my dashboard
  • dreams are quite fun part of lifelogging

Sms/calls

How: SMS Backup & Restore app, automatic exports.

Spotify

How: export script, using plamere/spotipy

Why:

  • potentially can be useful for better search in music listening history
  • can be used for custom recommendation algorithms

Taplog

(not using it anymore, in favor of org-mode)

How: regular copying of /data/data/com.waterbear.taglog/databases/Buttons Database

Why:

  • a quick way of single tap logging (e.g. weight/sleep/exercise etc), contributes to the dashboard

Twitter

How: twitter archive (manually, once), after that regular automatic exports via API

Why:

VK.com

How: Totktonada/vk_messages_backup.

Sadly VK broke their API so the script stopped working. I'm barely using VK now anyway so not motivated enough to work around it.

Why:

Weight

How: manually, used Nomie and Taplog, but now just using org-mode and extracting data with orgparse. Could be potentially automated via wireless scales, but not much of a priority for me.

Why:

TODO Whatsapp

Barely using it so haven't bothered yet.

How: Whatsapp doesn't offer API, so potentially going to require grabbing sqlite database from Android app (/data/data/com.whatsapp/databases/msgstore.db)

Why:

23andme

How: manual raw data export from 23andme website. I hope your genome doesn't change so often to bother with automatic exports!

Why:

  • was planning to setup some sort of automatic search of new genome insights against open source analysis tools

    Haven't really had time to think about it yet, and it feels like a hard project out of my realm of competence.

3 Data consumers

orger

orger is a tool and set of modules for accessing data via org-mode. It allows searching and overviewing, and in addition, I'm using it for creating tasks straight from native app interfaces (e.g. Reddit/Telegram) and spaced repetition via org-drill.

I write about it in detail here and here.

promnesia

promnesia is a browser extension I'm working on to escape silos by unifying annotations and browsing history from different data sources.

I've been using it for more than a year now and working on final touches to properly release it for other people.

dashboard

As a big fan of , I'm working on personal health, sleep and exercise dashboard, built from various data sources.

I'm working on making it public, you can see some screenshots here.

timeline

Timeline is a project I'm working on.

I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened. That way it works as a sort of external memory.

Ideally, it would look similar to Andrew Louis's Memex, or might even reuse his interface if he open sources it. I highly recommend watching his talk for inspiration.

my. python package

This python package is a kind of my personal API to access all collected data.

I'm in the progress of writing about it here.

4 --

Happy to answer any questions on my approach and help you with liberating your data.

In the next post (writing still in progress) I'm going to elaborate on design decisions behind my data export and access infrastructure.

Updates:

  • [2020-01-14]: added 'Nutrition', 'Shell history' and 'Sleep' sections

Discussion: