The sad state of personal data and infrastructure
TLDR: in this post, I'm going to be exploring missed opportunities at engaging and interacting with your personal data and digital trace, and speculating on why is it that way and how to make it easier.
It might seem like a long rant, but I promise you I am not the kind of person who whines and vents just for the sake of it!
In this particular post, I'm just building up motivation and engaging you, and I do have some workarounds and suggestions. This article got long enough, I also didn't want to mix discussions on motivation (this one) and my take on more technical details and implementation (which will follow).
Table of Contents
- 1. Intro: your data is trapped
- 2. Why does it bother me?
- search and information access
- journaling and history
- consuming digital content
- health and body maintenance
- personal finance
- why I can't do anything when I'm offline or have a wonky connection?
- tools for thinking and learning
- mediocre interfaces
- communication and collaboration
- 3. Your data is vanishing
- 4. What do I want?
- 5. So what's the problem?
- 6. How to make it easier: data mirror
- 7. What do I do?
- 8. Related links
- 9. --
¶1 Intro: your data is trapped
Note: for clarity, I will use 'service' to refer to anything holding your data and manipulating it, whether it's a website, phone app or a device (i.e. not necessarily something having an online presence).
On one hand, in 2019 things are pretty great. For almost anything you wish to do on your computer or phone, you can find several apps, platforms and ecosystems that will handle your task in one way or another.
On the other hand, typically, once the service has your data it's siloed and trapped. You are completely at the mercy of service's developers and management.
Within the same ecosystem (e.g. Google/Apple/Microsoft) you might get some integrations and interactions if the company spares them. Apart from these, integrations are virtually non-existent.
We have so much data, yet it just sits there doing nothing.
Now and then some startup pops up that connects together couple of APIs for a fee. I don't want to pick on startups but typically it's something trivial like displaying calories consumed from your food tracker app on the same plot as calories burnt from your fitness tracker. Trivial is okay, and I do acknowledge it's way harder to implement than it looks (I even explore why later). The sad reality is that as a user, you're lucky if you use the right kind of fitness tracker that the service supports, and you agree with their analysis methodology. Otherwise, sorry!
There are also services like IFTTT which offer pretty primitive integrations and also require cooperation from all parties:
Often UIs have inconveniences (or just plain suck). They are often fine for an average user (aka KPIs) but leave a number of dissatisfied users, who are often the power users.
In essence, services fully control the way they present information to you.
Sure, it's a free market, just switch to another/better service, right? Switching to new and unfamiliar tools is cognitively hard enough as it is, but what's even worse is that in most cases you have to leave behind all your old data. You're lucky if you can do some sort of data import/export and if it works properly.
Personal data is in a sad state these days. Let me elaborate.
¶2 Why does it bother me?
To be fair, I don't understand how does it not bother you!
To start with, allow me to quote myself here:
I consume lots of digital content (books, articles, Reddit, Youtube, Twitter, etc.) and most of it I find somewhat useful and insightful. I want to use that knowledge later, act and build on it. But there's an obstacle: the human brain.
It would be cool to be capable of always remembering and instantly recalling information you've interacted with, metadata and your thoughts on it. Until we get augmented though, there are two options: the first is just to suck it up and live with it. You might have guessed this is not an option I'm comfortable with.The second option is compensating for your sloppy meaty memory and having information you've read at hand and a quick way of searching over it.
- convenience of access, e.g.:
- to access highlights and notes on my Kobo ebook I need to actually reach my reader and tap through e-ink touch screen. Not much fun!
- if you want to search over annotations in your PDF collections… well good luck, I'm just not aware of such a tool. It's actually way worse: many PDF viewers wouldn't even let you search in highlights within the file you're currently viewing.
- there is no easy way to quickly access all of your twitter favorites, people suggest using hacks like autoscroll extension.
- searching data, e.g.:
- search function often just isn't available at all, e.g. on Instapaper, you can't restrict search to highlights. If it is available, it's almost never incremental.
- builtin browser search (
Ctrl-F) sucks for the most part: it's not very easy to navigate as you don't get previews and you have to look through every match
- sometimes you vaguely recall reading about something or seeing a link, but don't remember where exactly. Was it on stackoverflow? Or in some github issue? Or in a conversation with friend?
- data ownership and liberation, e.g.
what happens if data disappears or service is down (temporary/permanently) or banned by your government?
You may think you live in a civilized country and that would never affect you. Well, in 2018, Instapaper was unavailable in Europe for several months (!) due to missing the GDPR deadline.
- 99% of services don't have support for offline mode. This may be just a small inconvenience if you're on a train or something, but there is more to it. What if some sort of apocalypse happens and you lose all access to data? That depends on your paranoia level of course, and apocalypse is bad enough as it is, but my take on it is that at least I'd have my data :)
- if you delete a book on Kobo, not only you can't access its annotations anymore, but they seem to get wiped from the database.
As you can see, my main frustrations are around the lack of the very basic things that computers can do extremely well: data retrieval and search.
I'll carry on, just listing some examples. Let's see if any of them resonate with you:
¶search and information access
Why can't I search over all of my personal chat history with a friend, whether it's ICQ logs from 2005 or Whatsapp logs from 2019?
Why can't I have incremental search over my tweets? Or browser bookmarks? Or over everything I've ever typed/read on the Internet?
Why can't I search across watched youtube videos even though most of them have subtitles hence allowing for full text search?
Why can't my Google Home add shopping list items to Google Keep? Let alone other todo-list apps.
Instead, it puts them in a completely separate product, Shopping list. If any of these had an API, any programmer could write a script to synchronize them in a few hours.
Why can't I create a task in my todo list or calendar from a conversation on Facebook Messenger/Whatsapp/VK.com/Telegram?
Often, a friend recommends a book to you so you want it to add to your reading list. Or they ask you for something and you want to schedule a reminder.
Instead, these apps actively prevent me from using builtin Android share functions (because it means leaving the app presumably).
¶journaling and history
Why do I have to lose all my browser history if I decide to switch browsers?
Even when you switch between major ones like Chrome/Firefox. Let alone for less common alternatives.
Why can't I see all the places I traveled to on a single map and photos alongside?
I have location tracking and my photos have GPS and timestamps.
Why can't I see what my heart rate (i.e. excitement) and speed were side by side with the video I recorded on GoPro while skiing?
I've used HR tracking and location tracking, surely that's possible?
Why can't I easily transfer all my books and metadata if I decide to switch from Kindle to PocketBook or vice versa?
¶consuming digital content
Why can't I see stuff I highlighted on Instapaper as an overlay on top of web page?
Hypothes.is does it, so it's totally possible, right?
Why can't I have single 'read it later' list, unifying all things saved on Reddit/Hackernews/Pocket?
Why can't I use my todo app instead of 'Watch later' playlist on youtube?
'Watch later' is fine for short videos that I can watch over dinner or on my commute. Longer videos like talks and lectures need proper time commitment hence prioritizing.
Why can't I 'follow' some user on Hackernews?
It's just a matter of regularly fetching new stories/comments by a person and showing new items, right?
Why can't I see if I've run across a Youtube video because my friend sent me a link months ago?
The links are there in the chat history, surely it's a trivial task to find it?
Why can't I have uniform music listening stats based on my Last.fm/iTunes/Bandcamp/Spotify/Youtube?
Why am I forced to use Spotify's music recommendation algorithm and don't have an option to try something else?
Why can't I easily see what were the books/music/art recommended by my friends or some specific Twitter/Reddit/Hackernews users?
Why my otherwise perfect hackernews Android app doesn't share saved posts/comments with the website?
¶health and body maintenance
Why can't I tell if I was more sedentary than usual during the past week and whether I need to compensate by doing a bit more exercise?
I have all my location (hence step data), so what's the issue?
Why can't I see what's the impact of aerobic exercise on my resting HR?
I use HR tracker and sleep tracker, so all the necessary data is there.
Why can't I have a dashboard for all of my health: food, exercise and sleep to see baselines and trends?
Why do I need to rely on some startup to implement this and trust them with my data?
Why can't I see the impact of temperature or CO2 concentration in room on my sleep?
My sensors have Bluetooth and Android apps, why can't they interact with my sleep data?
Why can't I see how holidays (as in, not going to work) impact my stress levels?
It's trivial to infer workdays by using my location data.
Why can't I take my Headspace app data and see how/if meditation impacts my sleep?
Why can't I run a short snippet of code and check some random health advice on the Internet against my health data.
Why am I forced to manually copy transactions from different banking apps into a spreadsheet?
Why can't I easily match my Amazon/Ebay orders with my bank transactions?
¶why I can't do anything when I'm offline or have a wonky connection?
Aka #offline. On one hand it's less and less of an issue as the Internet gets more reliable. On the other if you start relying on it too much, it's becoming more and more of a single point of failure.
¶tools for thinking and learning
Why when something like 'mind palace' is literally possible with VR technology, we don't see any in use?
Why can't I easily convert select Instapaper highlights or new foreign words I encountered on my Kindle into Anki flashcards?
Why do I have to suffer from poor management and design decisions in UI changes, even if the interface is not the main reason I'm using the product?
Why can't I leave priorities and notes on my saved Reddit/Hackernews items?
I've got too many saved things to read them linearly and I'll probably never read them all. I've also got other things to read and do in general, why can't I have a unified queue for consuming content?
Why can't I leave private notes on Deliveroo restaurants/dishes, so I'd remember what to order/not to order next time?
Why do people have to suffer from Google Inbox shutdown?
Not to undervalue Inbox developers, but fundamentally it's just a different interface. I'm sure there are plenty of engineers who would happily support it in their spare time if only they had access to the APIs.
¶communication and collaboration
Why can't I easily share my web or book highlights with a friend? Or just make highlights in select books public?
Why can't I easily find out other person's expertise without interrogating them, just by looking what they read instead?
Why do I have to think about it and actively invest time and effort?
What about regular people who have no idea how unreliable computers can be and might find out the hard way?
I think all of this is pretty sad. Note that I haven't mentioned any mad science fiction stuff like tapping directly into the brain (as much as I wish it was possible). All these things are totally doable with the technology we already possess.
I wonder what computing pioneers like Douglas Engelbart (e.g. see Augmenting Human Intellect) or Alan Kay thought/think about it and if they'd share my disappointment. So many years have passed since the computing (and personal computers) spread, and we're still not quite there. And companies are actively promoting these silos.
Imagine if all of this was on your fingertips? If you didn't have to think about how and where to find information and could just access it and interact with it? If you could let computers handle the boring bits of your life and spend time on fun and creative things?
¶3 Your data is vanishing
Things I listed above are frustrating enough as they are. There is another aspect to this: your data is slipping away.
Privacy concerns are important and it's understandable when people are pissed about services keeping hold of their data instead of properly wiping it.
However, oftentimes the opposite is the case and you find that your data is gone or very hard to access:
Google Takeout data, that is, all your browser activity, Youtube watch history, etc., are only kept by Google for few years
If you only began exporting it today, chances are you've already lost some of your history.
- Chrome browser deletes history older than 90 days
- Firefox browser expires history based on some magic algorithm
- Reddit API limits your requests to 1000 results only
Twitter API would only give you 3200 latest tweets
You can get the rest of your tweets via manual export, but then you'll have to integrate two different ways of accessing data.
Monzo API only allows fetching all of your transactions within 5 minutes of authentication.
I understand that it's a security measure, but my frustration still stands.
The problems above are sort of technical and in theory, can be solved by some engineering. There is another side to vanishing data:
- information is generally rotting away from the Internet
comments/posts/tweets you've interacted with get deleted by their authors
While people have the right to delete their data from the Internet, arguably it doesn't extend to derived content like comments or thoughts that you had on it.
And a bit more:
Jawbone UP has gone bust
In July 2017 Jawbone announced it would liquidate its assets. Since the app is still available for at least some phones (Android) and the servers seem to be running, it is unclear who has access to collected personal data.
sweet. In addition, the API doesn't work anymore either, so if you haven't been exporting data, it's basically gone.
- 'My GitHub account has been restricted due to US sanctions as I live in Crimea'
This one is particularly bad.
If you consider your digital trace part of yourself, this is completely unacceptable. But sadly it's happening all the time. You can't rely on third parties to keep it safe.
¶4 What do I want?
I want all these inconveniences somehow solved, but I live in the real world and it's not gonna magically happen. So let me be more specific: I argue that one major reason these tools and integration I want don't exist is that people don't have easy uniform access to their data in the first place.
"Easy" is used here in two senses:
easy for humans to look at and browse through
This bit is hard in practice as (typically) the more machine friendly something is, the less human friendly it's.
easy for programmers to manipulate, analyze and interact with
Let's concentrate on this part for now. If this is solved, it automatically enables programmers to develop human-friendly tools.
So how would 'easy access to data' look in an ideal world? Let me present you my speculations on it, and I would be happy to hear your opinions on it!
I want an API that I can query and get any of my personal data. Ideally, it wouldn't really matter where the data is and it could be a web API.
Realistically, as of today, the easiest way to quickly access your data and more importantly, play with it, is when it's already on your filesystem.
As you've probably noticed, it's almost never the case that you have your personal data locally at hand. You need to spend extra effort to achieve this.
¶5 So what's the problem?
Hopefully we can agree that the current situation isn't so great. But I am a software engineer. And chances that if you're reading it, you're very likely a programmer as well. Surely we can deal with that and implement, right?
Kind of, but it's really hard to retrieve data created by you.
Recommended soundtrack for rest of the section: The World's Smallest Violin, playing for us software engineers.
At first glance it doesn't look like a big deal. It's just data, right? Every programmer should be capable of getting it from the API, right?
This is until you realize you're probably using at least ten different services, and they all have different purposes, with various kinds of data, endpoints and restrictions.
Even if you have the capacity and are willing to do it, it's still damn hard.
You're gonna have to deal with the following problems:
That's where it all starts with and it's a mess.
- easiest scenario: the service lets you generate an API token from its settings and you can just use it. Example: pinboard
typical scenario: you need to do the whole Oauth thing.
That involves creating a client app, getting client id, dealing with scopes and redirect urls, etc. Pretty tedious, and you certainly can't expect a nonprogrammer to be able to follow these steps.
Examples: almost every service with an API out there: Twitter/Instapaper/Pocket/Github/etc.
worst case scenario: the service doesn't even offer a public API. That also has different grades of horrible:
best worst: service uses a private API and you can spy on the token web app is using in browser dev tools.
Not too bad, but a bit dubious.
Example: Pocket API doesn't give you away highlights unless you mess with it.
typical worst: no private API, so you need to scrape the data. Sometimes you can grab the cookies from browser dev tools and use them to access your data.
Scraping is orders of magnitude flakier, involves nasty parsing and obviously fragile. Some services might even actively prevent you from doing so by banning unusual user agents.
worst worst: you need to scrape the data and cookies don't work or expire often.
Basically means you need to use your username/password. Bonus points if there is 2-factor auth involved.
Potentially means you're going to store your password somewhere which is way less secure than using a token.
Example: Google Takeout exports are not only asynchronous, but also don't have an API so you have to login in order to export.
All the 'worst' scenarios are extremely flaky and basically impossible for nonprogrammers to use.
Whether you're using API or not, typically you'll have to retrieve multiple chunks of data and merge them after.
In principle, it's not hard to implement it on a one off basis, but unclear how to do it in some universal way because there is no common standard.
Pages might be addressed by page numbers and counts, offsets from start/end of data, before or after with respect to ids or timestamps, etc.
It's quite error prone: content might change under your feet, and if the API developers or you are not careful, you might end up with missing data or even some logical corruption.
If you simply start fetching a json and writing to disk, you'd very quickly end up with a corrupt file on the first network failure. You've gotta be really careful and ensure atomic writing and updating.
Even if you work around the atomicity issues, chances are you won't be able to guarantee atomic snapshotting as you're fetching your data within multiple requests, and the data is changing as you retrieve it.
No one likes their API hammered, fair enough. However, rate limits often vary from API endpoint to endpoint and are inherently tedious to get right.
If you're not using the API, you might get banned by DDOS prevention (e.g. Cloudflare) if you're not careful.
Overall, painful and not fun to implement.
Authorization, network, serializing, parsing, storing, synchronizing. There are among the most common error sources (as in, actual unrecoverable errors, not necessarily bugs) in software engineering. Generally, getting it right is required for reliably retrieving your data.
In addition, you want to be somewhat semi-defensive, and this is the hardest kind of error handling:
- you want to progress slowly but surely
- you want to make sure it only fails in completely unrecoverable scenarios, otherwise it's going to require constant tending
- and you want to somehow let user know of problems/suspicious data
¶documentation and discovery
If you want all your data, you have to look carefully through all the documentation and make sure you've got it all covered.
If the service adds some new endpoints, you might never find out.
For the most part not an issue, but some websites do not offer an API so you've got not choice but scraping and parsing HTML.
Notorious example: some Hackernews (!) endpoints like 'favorites' are not exposed via API.
Having raw export data (e.g. sqlite database/json file/etc) is nice, but to actually use it you need an abstract representation. You basically have to reinvent whatever the service developer does on the backend already.
- unclear which data types to choose: nullable/non-nullable, string or integer for ids, float or integer for amounts
- timestamps: figuring out whether it was seconds or milliseconds, UTC or local timezone; and zillions of string formats which you need to parse (I had to do it so often that I even memorized the weird argument order in
- which situations are valid, e.g. can id be used as a dictionary key, can you assume that they are increasing, etc.
¶no access to data
Sometimes you have no way to access your data at all:
- you are offline: nuff said
app data on your phone
Very few apps support data exports; even fewer support it in an automatic and regular way. Normally, internally, apps keep their data in sqlite databases which is even more convenient than plaintext/csv export.
However, there are caveats: e.g. on Android, app data is in
/data/data/directory, which by default isn't accessible unless you rooted the phone.
Now, remember when I said it was tedious for programmers? Constant boilerplate, shitty APIs (you're lucky if the service offers one at all), latency, flakiness, having to code defensively, etc.
Now think about ordinary people who have no clue what 'API' is. They deserve to use their data too.
¶6 How to make it easier: data mirror
The way I see it, ideally the service you're using provides you with:
a data mirror app
Best case scenario is if the service is local-first in the first place. However, this may be a long way ahead and there are certain technical difficulties associated with such designs.
I'm suggesting a data mirror app which merely runs in background on the client side and continuously/regularly sucks in and synchronizes backend data to the latest state.
Ideally this would be exactly the same state the backend uses, although in practice it would be hard from efficiency considerations (e.g. it's faster for the backend to keep data in the same database instead of separate databases for each user).
It shouldn't be too resource demanding for the backend, e.g. data sync via push notifications basically already does that, but in an even less efficient way.
Data mirror app should dump data into an open machine-friendly format like json/sqlite database/etc.
authorization: however tedious it's to implement, can be handled by the service's developers.
They can make it as secure as necessary (e.g. 2FA/etc), and it's okay as long as you have to log onto it only once.
- pagination/consistency/rate limiting: non-problems, considering it's easier for the service's developers to correctly implement incremental data fetching
- error handling: also the developers' responsibility. They would be better aware of which situations are programming bugs and which have to be handled carefully
- documentation and discovery: hopefully developers are better suited to keep their internal representations and exports consistent (even incentivised as it allows for less code to be written)
- backups: will still have to be done by external means, but the task is massively simplified: you just need to point your backup tool at your data storage
minimalistic data bindings in some reasonable programming language that represent all of this data.
Hopefully, specific language doesn't matter, it's a simple task to map data from one programming language to another.
- parsing: developers know better how to get it right; in addition the code can potentially be shared with the backend
- abstract representation: would massively lower the barrier for integrating and interacting with data
- offline: if you have all data locally you've got efficient access without latency and need for extra error handling
That's perhaps a naive and oversimplified view. But to be honest, we're so far away from that that even some small steps towards would be quite a progress.
These suggestions would decouple data from the UI and let the community develop better tools for consuming and working with it.
this might be hard to support for everyone
On the other hand, service developers would have more control on data access patterns, so in a way it might work better.
It would definitely be more efficient than third parties writing kludgy tools to export and backup data.
In addition, for some services and scenarios, it would give better data locality and lower latencies.
'average' users often are not motivated enough to demand such things
In particular, not everyone has or willing to set up necessary infrastructure to run all these things.
However, if implemented properly, there is absolutely nothing preventing running a data mirror on your laptop or even phone. It really doesn't require much CPU or bandwidth if you support incremental updates.
services have little motivation to promote this, silos benefit them
Having a monopoly on the client interface (e.g. web UI) keeps users on your platform even if you suck.
If anyone can implement a better interface, there would be little opportunity for stuff like ads, and the only way for the service to make money would be to collect a fee for data collection and hosting. (which I personally would be happy to pay)
Hopefully all of these issues would be solved by distributed/federated services, but we're pretty far from it.
E.g. imagine you liked someone's post on Facebook, it got mirrored locally, and then the author removed the post.
What's the right thing to do for the data mirror app? Should it erase just the post you liked from your data mirror? Should it keep the fact that you liked it at all?
You may disagree with the way such a policy is imposed by the service, hence implement additional logic to keep more data, and at that point it seems like a matter for legal debate.
If you want to access data from multiple devices, you either have to run multiple mirrors, which would be a bit of a hassle, or use some continuous sync service like Dropbox or Syncthing.
That however might not be so atomic, depending on the way data is kept on the disk, since files might be pulled in random or lexicographic order, depending on sync configuration.
protecting the data
Even if you don't trust your average startup at securing your data, it might be even less safe on average user's disk.
it's assumed that these tools/integrations are open source and running on computers you own.
Realistically, closed source tools do exist and it's understandable when people want money for their efforts.
From a user's perspective not everyone wants the hassle of running things locally either and many people are happy with online services for the most part.
¶7 What do I do?
Of course, I'm not expecting someone to come and implement all of this for me. I could start some sort of movement to demand it from services and platforms, but I hardly see myself as a good fit for that role.
Instead I've put effort into exporting, integrating and utilizing my data on my own according to the suggestions I formulated. Putting this in writing helped me motivate and summarize many technical and infrastructural decisions.
I'll be describing my setup in more detail in future posts, however here are some bits and pieces:
¶regular data exports
This corresponds to the 'data mirror' bit.
I exported/scraped/reverse engineered pretty much my entire digital trace and figured out automation and infrastructure which works for me.
I've shared some of my personal export scripts and tools.
I also have some helper scripts to keep individual exporter's code as clean as possible while ensuring exports are reliable.
As I mentioned, I'll share all of this later in a separate post.
¶python package to access data
Each data exporter comes with minimal bindings that merely map json/sqlite export into simple datatypes and data classes.
That way anyone who wishes to use data can kick off some reasonable representation, which is not overfitted to my specific needs.
Higher level querying and access, specific to myself is implemented in my. package (note that this post is still in draft stage).
my. package allows me to query my data from anywhere, enabling me to use familiar data processing, analysis and visualization tools, and various integrations.
As a nice byproduct I've also finally figured out a reliable and elegant way to deal with error handling in Python.
¶how do I use the data?
Finally, some tools and scripts I've implemented to make possible the interactions that I want:
- A personal search engine for quick incremental search of my data and digital trace
- orger: tool to convert data into org-mode views for instant and offline search and overview
- grasp, browser extension to clip links straight into my org-mode notes
- promnesia, a browser extension to escape silos by unifying annotations and browsing history from different data sources (still somewhat WIP and needs final touches, but planning to release soon)
- personal health, sleep and exercise dashboard, built from various data sources. I'm in the process of making it public, you can see some screenshots here
I wrote how each specific data source I export contributes to my personal infrastructure here.
¶8 Related links
I'd be interested to know your opinion or questions, whether on my motivation, or particularities of my suggestions or implementation.
Let me know if you can think of any other data integrations you are missing and perhaps we can think of something together!