settings:
show timestamps

Orger: plaintext reflection of your digital self [see within blog graph]

Mirror your personal data as org-mode for instant access and search

TLDR: I'll write about orger, a tool I'm using to convert my personal data into easily readable and searchable org-mode views. I'll present some examples and use cases, that will hopefully be helpful to you as well even if you are not sold by using my tool.

There is also second part where I'm explaining how it can be used to process Reddit, create quick tasks from Telegram messages and help with spaced repetition. If you're impatient, you can jump straight to a demo.

1 Intro

I consume lots of digital content (books, articles, Reddit, Youtube, Twitter, etc.) and most of it I find somewhat useful and insightful. I want to use that knowledge later, act and build on it. But there's an obstacle: the human brain.

It would be cool to be capable of always remembering and instantly recalling information you've interacted with, metadata and your thoughts on it. Until we get augmented though, there are two options: the first is just to suck it up and live with it. You might have guessed this is not an option I'm comfortable with.

The second option is compensating for your sloppy meaty memory and having information you've read at hand and a quick way of searching over it.

That sounds simple enough but as with many simple things, in practice you run into obstacles. I'll give some I've personally been overcoming as examples:

  • convenience of access, e.g.:
    • to access highlights and notes on my Kobo ebook I need to actually reach my reader and tap through e-ink touch screen. Not much fun!
    • if you want to search over annotations in your PDF collections… well good luck, I'm just not aware of such a tool. It's actually way worse: many PDF viewers wouldn't even let you search in highlights within the file you're currently viewing.
    • there is no easy way to quickly access all of your twitter favorites, people suggest using hacks like autoscroll extension.
  • searching data, e.g.:
    • search function often just isn't available at all, e.g. on Instapaper, you can't restrict search to highlights. If it is available, it's almost never incremental.
    • builtin browser search (Ctrl-F) sucks for the most part: it's not very easy to navigate as you don't get previews and you have to look through every match
    • sometimes you vaguely recall reading about something or seeing a link, but don't remember where exactly. Was it on stackoverflow? Or in some github issue? Or in a conversation with friend?
  • data ownership and liberation, e.g.
    • what happens if data disappears or service is down (temporary/permanently) or banned by your government?

      You may think you live in a civilized country and that would never affect you. Well, in 2018, Instapaper was unavailable in Europe for several months (!) due to missing the GDPR deadline.

    • 99% of services don't have support for offline mode. This may be just a small inconvenience if you're on a train or something, but there is more to it. What if some sort of apocalypse happens and you lose all access to data? That depends on your paranoia level of course, and apocalypse is bad enough as it is, but my take on it is that at least I'd have my data :)
    • if you delete a book on Kobo, not only you can't access its annotations anymore, but they seem to get wiped from the database.

Thinking about that and tinkering helped me understand what I want: some sort of search engine, over my personal data, with uniform and always available way of accessing it.

So, I present you a system that I've developed and that solves all my problems™: orger.

2 What Orger does

It's really so quite trivial that it's almost stupid. Orger provides a simple python API to render any data as an Org-mode file. It's easier to give an example:

from orger import StaticView
from orger.inorganic import node, link
from orger.common import dt_heading

import my.github_data

class Github(StaticView):
  def get_items(self):
    for event in my.github_data.get_events():
      yield node(dt_heading(event.dt, event.summary))

Github.main()

That ten line program results in a file Github.org:

# AUTOGENERATED BY /code/orger/github.py

* [2016-10-30 Sun 10:29] opened PR Add __enter__ and __exit__ to Pool stub
* [2016-11-10 Thu 09:29] opened PR Update gradle to 2.14.1 and gradle plugin to 2.1.1
* [2016-11-16 Wed 20:20] commented on issue Linker error makes it impossible to use a stack-provided ghc
* [2016-12-30 Fri 11:57] commented on issue Fix performance in the rare case of hashCode evaluating to zero 
* [2019-09-21 Sat 16:51] commented on issue Tags containing letters outside of a-zA-Z
....

Even with event summaries only it can already be very useful to search over. What you can potentially do really depends on your imagination and needs! You can also add:

  • links
  • tags
  • timestamps
  • properties
  • child nodes

See 'Examples' section for more.

So as you can see orger itself is a really not sophisticated tool, at least until you spend time trying to reimplement the same. As always the devil is in the details (look at that cheeky my.github_data import), which I'll explain further.

3 Demo: displaying Pocket data via Orger

I've documented one of modules, pocket_demo so you could get the sense of using Orger.

Click to view the code
#!/usr/bin/env python3
"""
Demo Orger adapter for Pocket data. For documentation purposes, so please modify pocket.py if you want to contribute.
"""

"""
First we define some abstractions for Pocket entities (articles and highlights).

While it's not that necessary and for one script you can get away with using json directly,
 it does help to separate parsing and rendering, allows you to reuse parsing for other projects
 and generally makes everything clean.

Also see https://github.com/karlicoss/HPI package for some inspiration.
"""


from datetime import datetime
from pathlib import Path
from typing import NamedTuple, Sequence, Any

class Highlight(NamedTuple):
    """
    Abstract representation of Pocket highlight
    """
    json: Any

    @property
    def text(self) -> str:
        return self.json['quote']

    @property
    def created(self) -> datetime:
        return datetime.strptime(self.json['created_at'], '%Y-%m-%d %H:%M:%S')


class Article(NamedTuple):
    """
    Abstract representation of Pocket saved page
    """
    json: Any

    @property
    def url(self) -> str:
        return self.json['given_url']

    @property
    def title(self) -> str:
        return self.json['given_title']

    @property
    def pocket_link(self) -> str:
        return 'https://app.getpocket.com/read/' + self.json['item_id']

    @property
    def added(self) -> datetime:
        return datetime.fromtimestamp(int(self.json['time_added']))

    @property
    def highlights(self) -> Sequence[Highlight]:
        raw = self.json.get('annotations', [])
        return list(map(Highlight, raw))

    # TODO add tags?


def get_articles(json_path: Path) -> Sequence[Article]:
    """
    Parses Pocket export produced by https://github.com/karlicoss/pockexport
    """
    import json
    raw = json.loads(json_path.read_text())['list']
    return list(map(Article, raw.values()))

"""
Ok, now we can get to implementing the adapter.
"""
from orger import Mirror
"""
Mirror means it's meant to be read-only view onto data (as opposed to Queue).
"""
from orger.inorganic import node, link
from orger.common import dt_heading


class Pocket(Mirror):
    def get_items(self):
        """
        get_items returns a sequence/iterator of nodes
        see orger.inorganic.OrgNode to find out about attributes you can use
        """
        export_file = self.cmdline_args.file # see setup_parser
        for a in get_articles(export_file):
            yield node(
                heading=dt_heading(
                    a.added,
                    # 'pocket' permalink is pretty convenient to jump straight into Pocket app
                    link(title='pocket', url=a.pocket_link)  + ' · ' + link(title=a.title, url=a.url),
                ),
                children=[node( # comments are displayed as org-mode child entries
                    heading=dt_heading(hl.created, hl.text)
                ) for hl in a.highlights]
            )


def setup_parser(p):
    """
    Optional hooks for extra arguments if you need them in your adapter
    """
    p.add_argument('--file', type=Path, help='JSON file from API export', required=True)


if __name__ == '__main__':
    """
    Usage example: ./pocket.py --file /backups/pocket/last-backup.json --to /data/orger/pocket.org
    """
    Pocket.main(setup_parser=setup_parser)

"""
Example pocket.org output:

# AUTOGENERATED BY /L/zzz_syncthing/coding/orger/pocket.py

* [2018-07-09 Mon 10:56] [[https://app.getpocket.com/read/1949330650][pocket]] · [[https://www.gwern.net/Complexity-vs-AI][Complexity no Bar to AI - Gwern.net]]
** [2019-09-22 Sun 03:36] iving up determinism and using randomized algorithms which are faster but may not return an answer or a correct answer1
** [2019-06-22 Sat 16:48] The apparent barrier of a complex problem can be bypassed by (in
* [2016-10-21 Fri 14:42] [[https://app.getpocket.com/read/1407671000][pocket]] · [[https://johncarlosbaez.wordpress.com/2016/09/09/struggles-with-the-continuum-part-2/][Struggles with the Continuum (Part 2) | Azimuth]]
* [2016-05-31 Tue 18:25] [[https://app.getpocket.com/read/1042711293][pocket]] · [[http://www.scottaaronson.com/blog/?p=2464][Bell inequality violation finally done right]]
* [2016-05-31 Tue 18:24] [[https://app.getpocket.com/read/1188624587][pocket]] · [[https://packetzoom.com/blog/how-to-test-your-app-in-different-network-conditions.html][How to test your app in different network conditions -]]
* [2016-05-31 Tue 18:24] [[https://app.getpocket.com/read/1191143185][pocket]] · [[http://www.schibsted.pl/2016/02/hood-okhttps-cache/][What's under the hood of the OkHttp's cache?]]
* [2016-03-15 Tue 17:27] [[https://app.getpocket.com/read/1187239791][pocket]] · [[http://joeduffyblog.com/2016/02/07/the-error-model/][Joe Duffy - The Error Model]]
** [2019-09-25 Wed 18:20] A bug is a kind of error the programmer didn’t expect. Inputs weren’t validated correctly, logic was written wrong, or any host of problems have arisen.
** [2019-09-25 Wed 18:19] First, throwing an exception is usually ridiculously expensive. This is almost always due to the gathering of a stack trace.
** [2019-09-25 Wed 18:20] In other words, an exception, as with error codes, is just a different kind of return value!
"""
Click to view the output

[2018-07-09 Mon 10:56] pocket · Complexity no Bar to AI - Gwern.net

[2016-03-15 Tue 17:27] pocket · Joe Duffy - The Error Model

[2019-09-25 Wed 18:20] A bug is a kind of error the programmer didn’t expect. Inputs weren’t validated correctly, logic was written wrong, or any host of problems have arisen.

[2019-09-25 Wed 18:19] First, throwing an exception is usually ridiculously expensive. This is almost always due to the gathering of a stack trace.

[2019-09-25 Wed 18:20] In other words, an exception, as with error codes, is just a different kind of return value!

As you can see, it's very easy to see and search in all of your highlights. Clicking on 'pocket' will jump straight to the Pocket web app to the article you were reading.

4 More examples

I'm using more than ten different Orger modules, most of which I've moved into the repository. Here I'll describe some featured views I'm generating.

To give you a heads up, if you read the code, you'll see bunch of imports like from my.hypothesis import .... I find it easier to move all data parsing in a separate my package, that deals with parsing and converting input data (typically, some JSON). That makes everything less messy, separates data and rendering and lets me reuse abstract models in other tools. Also that lets me access my data from any python code, which makes it way easier to use and interact with data.

Some of these are still private so if you're interested in something not present in the github repo, please don't be shy and open an issue, so I can prioritize.

Hopefully the code is readable enough and will give you some inspiration. If you find something confusing or you write your own module and want to contribute, please feel free to open issue/PR!

instapaper

Instapaper doesn't have search over annotations, so I implemented my own!

hypothesis

Hypothesis does have search, but it's still way quicker for me to invoke search in Emacs (takes literally less than a second) than do that in web browser.

kobo

Generates views for all highlights and comments along with book titles from my Kobo database export.

pinboard

Searches over my Pinboard bookmarks.

pdfs

Crawls my filesystem for PDF files and collects all highlights and comments in a single view.

twitter

It's got two modes

  • First mode generates a view of everything I've ever tweeted, so I can search over it.
  • Second mode generates a view of all older tweets from the previous years posted on the same day. I find it quite fascinating to read through it and observe how I've been changing over years.

rtm2org

I stopped using Remember The Milk a while ago, but there are still some tasks and notes I've left behind, which I'm slowly moving to org-mode or canceling over time.

telegram2org

Lets me create todo tasks from Telegram messages in couple of taps (you can't use share function on them in Android).

I write about it in the second part.

reddit2org

Displays and lets me search my Reddit saved posts/comments.

I write about it in the second part.

Roam Research

Mirrors Roam Research database, allowing for quick search and navigation within Emacs.

I write about it in detail here, there is also a video demo.

5 It seems trivial?

Does that really deserve a post?

Well yeah it really does seem simple… until you try to do it.

  • emitting Org-mode

    While it's plaintext, and generating simple outlines is trivial, with more sophisticated inputs, there is some nasty business of escaping and sanitizing that has to be dealt with. I didn't manage to find any Python libraries capable of emitting Org-mode. Only project I knew of was PyOrgMode but the author abandoned it.

    When it comes to generating 10+ views from different data sources, you really want to make sure it's as little effort and minimal boilerplate as it can possibly be.

    That's how inorganic library was born.

  • accessing data sources and exposing it through Python interfaces

    This is probably where most of effort was spent. All sorts of stupid APIs, tedious parsing, you can imagine.

    I write about it in detail in "Human Programming Interface".

  • keeping track of already processed items for Interactive views

    Because there is no feedback from org-mode files back to data sources, you want to keep track of items already added in the file, otherwise you're gonna have duplicates.

    It's not rocket science of course, but it is quite tedious. There is some additional logic that checks for lock files, makes sure writes are atomic, etc. You really don't want to implement it more than once. I figured it was worth extracting this 'pattern' in a separate python module.

6 What makes Orger good?

  • it solves my problems!

    I won't go long into Org-mode propaganda, there are people that do it better than me out there, but for me it's good because it's a decent balance between ease of use and ease of augmenting.

    • it's easy to do unstructured (i.e. grep) or structured (i.e. tag search in emacs) search on any of your devices be it desktop or phone
    • you can open it anywhere you can open a text file
    • tasks as easy to create as any other Org outline so it can integrate with your todo list and agenda (see more in the second part).
  • it doesn't require Emacs

    If you're not willing to go full on Emacs, you can still benefit from this setup by using plaintext viewer and search tool of your choice.

  • written in Python. I don't claim at all that Python is the best programming language, but that's the one I'm most productive on as well as many other people.

    Also the fact that it's a real programming language rather than some YAML config makes sure you can do anything and not restricted by stupid DSL.

  • it's extremely easy to add new views — a matter of 10-20 lines of code.
  • agnostic to what you feed in it – it could be offline data from your regular backups, or it could be fresh API data. Again, it's a real programming language, you can do literally anything.

7 Using Orger views

Apart from, obviously, opening org mode file in your favorite text editor, one major strength of this system is being able to effortlessly search over them.

I'm writing extensively about my information search setup here. In summary:

  • on my desktop I'm just using spacemacs or cloudmacs from web browser
  • in Emacs, I'm usually just using helm-ag with ripgrep
  • sometimes helm-swoop is very convenient
  • org-tags-view or helm-org-ql for structured Org-mode search
  • I've got hotkeys set up that invoke Emacs window with search prompt in a blink

On my phone I'm using:

You can also set up some proper indexing daemon like recoll.

Typical use patterns

I'll just give some of my use cases:

  • While running tests for orgparse I started randomly getting AssertionError: Cannot find component 'A@3' for 'orgparse.A@3.

    I recall that I had same issue few month ago but don't quite remember what was the fix. I press F1 which invokes helm-ag for me and type 'cannot find component'. I instantly find a github issue I opened in github.org and figure out what I need to do to work around the problem.

  • While discussing special relativity with a friend, I recall watching some intuitive rationale for Maxwell's equations, but don't quite recall what was the video.

    I press F1, type 'Special relativity' and instantly get few results, in particular this awesome Veritasium video in youtube.org, which I was looking for.

  • Recommending books

    I often struggle to recall the details why I liked a particular book, especially fiction. Having all annotations in my kobo.org file lets me quickly look up and skim through highlighted bits, so I can freshen up my memory.

8 Potential improvements

TODOmore frequent, ideally realtime updates to views

If the API doesn't provide push-based interface (as most of them), ultimately it's a question of polling them carefully to avoid rate limiting penalties.

TODOalternative export formats

There is nothing really about Org-mode that's specific to this system. For instance, there are markdown-based organizers out there and people could benefit from using Orger for them.

TODOtwo-way data flow

It would be cool to implement feedback from emacs, e.g. editing Github comment when you edit the corresponding Orger item. But it requires considerably more effort and would only work within emacs.

TODOpotential for race condition

Unfortunately there is a little space for race condition if Orger appends something while you're editing file. Orger tries to detect emacs and vim swap/lock files, but it's if you're very unlucky or using different setup it's still possible. Hopefully your text editor warns you when the file had been overwritten while you were editing it (e.g. as emacs does).

Also, I run Orger jobs at night (via cron) so it's quite unlikely to overlap with editing anything.

9 Similar projects

  • Memacs by Karl Voit.

    I only discovered it after I released Orger, so frankly haven't got to try it yet! It looks very similar in terms of goals and seems we can cooperate on rendering parts at least.

    One (as I see it) big advantage of my setup is that data providers are abstracted away in my. package, which makes everything more modular and resilient. However Memacs seems to be flexible as well, so it can be used with e.g. my. package as well.

    If someone compares Memacs and Orger please let me know, I'd be happy to link it! It would also be more objective than comparison by myself!

10 ----

I'd be interested in hearing your thoughts or feature requests.

This post ended up longer that I expected so in the next part I will tell about more use cases, in particular how I'm using Orger to process Reddit.


Discussion: