cachew

⭐ Cachew

Cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lrucache).
The difference from functools.lrucache is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.
Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaining it.

https://github.com/karlicoss/cachew#what-is-cachew

Table of Contents

[A] * motivation cachew

TODO [C] why database cachew

CREATED: [2020-10-11]
  • compact
  • fast
  • easy iteration
  • possible to use as data interface by querying the db directly

why using separate columns at all

  • actually appart from potentially nicer queries I'm not sure. I wonder if more time is spent serializing as opposed to simply using json?

TODO [C] nice feature is that it's easy to compare changes… I can simply run sqldiff cachew

CREATED: [2020-10-09]

DONE [B] Sqlalchemy orm requires using their adapters (check that). Also check if they support nested stuff cachew

CREATED: [2019-07-16]

[2019-07-25] sqlalchemy core doesn't support binding? that dataset got their own ResultIter adapter cachew

[2019-07-25] https://docs.sqlalchemy.org/en/13/core/custom_types.html somewhat awkward and invasive for existing types cachew

[2019-07-30] performance: I haven't tested cachew

TODO [C] combines some of my favorite python features: type annotations, decorators, generators, context managers cachewpythontoblog

CREATED: [2019-05-01]

[D] [2019-07-19] Is it possible to use a namedtuple with SQLalchemy? : Python cachew

https://www.reddit.com/r/Python/comments/kz7vj/is_it_possible_to_use_a_namedtuple_with_sqlalchemy/

sqlalchemy needs to attach attributes onto a mapped class, which include one for each column mapped, as well as an attribute _sa_instance_state for bookkeeping. It also, as someone mentioned, needs to create a weak reference to the object. So a SQLAlchemy-mapped class needs to be a "new style class", extending from object.
If you look at how namedtuple works, it has quite a bit of trickery going on, not the least of which is that it sets __slots__ to (). Then there's some super-nasty frame mutation stuff after that to somehow make it pickleable.
That said it's straightforward to make your own class that is SQLAlchemy compatible and acts more or less the same as a namedtuple. Just define __iter__, __len__, __repr__, other methods as appropriate. Plus it will be pickleable without injecting variables into stack frames.

[A] * prior art/similar projects cachew

STRT [B] cattrs cachew

CREATED: [2020-05-04]
When you're handed unstructured data (by your network, file system, database...), cattrs helps to convert this data into structured data.
When you have to convert your structured data into data types other libraries can handle, cattrs turns your classes and enumerations into dictionaries, integers and strings.

[2020-05-16] ok, not sure if it supports NT/dataclasses? Hypothesis tests are nice though cachew

TODO [C] [2020-05-26] freelancer/pavlova: A python deserialisation library built on top of dataclasses cachewcache

[2020-07-13] register types in the same way? https://github.com/freelancer/pavlova#adding-custom-types cachewcache

[C] [2020-03-11] unmade/apiwrappers: Build API clients that work both with regular and async code cachewaxol

https://github.com/unmade/apiwrappers

Modern - decode JSON with no effort using dataclasses and type annotations

hmm kind of similar to what I'm doing/want for axol?

[B] * potential features cachew

STRT [A] thinking about incremenal caching cachew

CREATED: [2020-07-25]
  • caching diffs

    • reasonable perf boost
    • relatively easy? just ignore 'emitted'?
    • automatically works for changed prefix (bleanser)

    ? requires changes to cachew key handling

    • might be still slowish
  • explicitly querying for cached prefix
    • best performance
    • fairly easy
    • almost no changes to cachew
    • requires restructuting code in a specific way, mcachew thing might be harder
    • gonna be tricky if the prefix can cahnge (bleanser) (although if we can probe for a cached key, it can work?)
  • single cachew decorator (not sure if possible?)
    • best performance
    • pretty simple
    • still requires 'hack' in the caller for detecting if something was cached or not
    • if prefix changes won't work
    • requires cachew changes? some sort of global function context? pretty unclear how to implement
      • with cached(f) as cf:
        return cf(someargs)
      • ok, maybe by default cachew is 'recursive'?
        when we enter the function, we memorize the argument that needs to be cached, but we don't lock the database yet?
        so, let's consider
        @cachew
        def factorials(n: int) -> int:
        last = 1
        for prev in factorials(n - 1):
        yield prev
        yield prev * n
        say, we've run factorials(3) before, the cache has [1, 2, 6]
        factorials(5), cachew memorizes {factorials: 5}, it's the one being computed, goes inside the function
        factorials(4) – not in the cache. so it goes inside and tries evaluating factorials(3)
        factorials(3) – in the cache, cachew opens the db ans starts emitting?
        factorials(4) shouldn't be writing becase it's not the one being computed
        factorials(5) on the other hand should start writing
        kind of a problem however is that it reads and writes at the same time.. I guess that could work with transactions?

TODO [C] preserve traceback? cachew

CREATED: [2020-10-18]

TODO [B] cache is gonna be expired several times a day anyway judging by bleansed backups… so I kind of need to do incremental anyway cachewbleanserhpireddit

CREATED: [2020-06-21]

TODO [B] maybe instead of key equality, use key comparison? assume that if the key is bigger, in includes all the data for smaller keys cachew

CREATED: [2020-07-14]

TODO [B] could cache as a Protocol.. and then reconstruct back a dataclass? odd but could work? cachew

CREATED: [2020-10-13]

TODO [C] not sure how to compute dependencies automatically? cachew

CREATED: [2019-07-25]

TODO [C] should be like Logger. global default + instances for more customization cachew

CREATED: [2020-05-16]

TODO [C] keep data along with hash in the same table? cachew

CREATED: [2020-01-05]

feels a bit more atomic…

TODO [C] create database, continuously updated by an iterable? could be useful for logs cachew

CREATED: [2020-01-14]

STRT [C] for upgradeable storage – I guess it should be a special function, first argument is an iterable that will be populated from the cache regardless. then it's up to the caller to determine what to process? cachewpromnesia

CREATED: [2020-07-24]

TODO [C] try using with classmethods? https://hynek.me/articles/decorators/#tldr cachew

CREATED: [2020-01-06]

TODO [C] for persisting, I guess it makes sense to use namedtuples, not just json, e.g. custom sql queries might actually use structure cachew

CREATED: [2020-01-02]

TODO [C] Support anon tuples? As long as they are typed… cachew

CREATED: [2019-08-05]

[2019-08-14] tried to implement tuples support… but it's just too freaking hacky… cachew

def test_typing_tuple(tmp_path):
    tdir = Path(tmp_path)

    @cachew(tdir / 'cache')
    def get_data() -> Iterator[Tuple[str, int]]:
        yield ('first' , 1)
        yield ('second', 2)

    assert list(get_data())[-1][0] == 'second'
    assert list(get_data())[-1][1] == 2

TODO [C] [2020-01-13] Shit! If merging is implemented recursivelyz like Fibonacci, cachew could support properly incremental exports? cachew

STRT [B] use appdirs cachew

CREATED: [2021-02-14]

TODO [C] Hide tail call optimization in cachew?? Hmmm cachew

CREATED: [2020-12-19]

TODO [B] env variable to turn it off?? cachew

CREATED: [2021-03-07]

TODO [C] make it depend on the git hash? I guess global override would be nice cachew

CREATED: [2020-10-19]

[B] * publicity cachewpublish

[C] [2020-04-09] Pyfiddle cachewpublishdemo

DONE [D] [2020-01-09] PyCoder’s Weekly on Twitter: "cachew: Persistent Cache/Serialization Powered by Type Hints https://t.co/x587YrhtLE" / Twitter cachewpublish

TODO [B] Link to hpi draft and exports draft cachewpublishhpiexports

CREATED: [2020-01-06]

STRT [C] could post on HN and lobsters as well cachewpublish

CREATED: [2019-11-04]

TODO [C] Perhaps merging bluemaesro databases could be a good example? cachewpublishbluemaestro

CREATED: [2019-08-05]

TODO [C] could demonstrate this? cachewpublish

CREATED: [2020-01-06]

perhaps with more processing difference would be even more striking…

$ time with_my python3 -c 'from my.bluemaestro import get_dataframe; print(get_dataframe())'
USING CACHEW!!!
                     temp
dt
2018-07-15 02:57:00  24.3
2018-07-15 02:58:00  24.3
2018-07-15 02:59:00  24.3
2018-07-15 03:00:00  24.3
2018-07-15 03:01:00  24.3
...                   ...
2019-07-27 10:42:00  23.8
2019-07-27 10:43:00  23.8
2019-07-27 10:44:00  23.8
2019-07-27 10:45:00  23.8
2019-07-27 10:46:00  23.8

[549054 rows x 1 columns]
with_my python3 -c   3.32s user 0.36s system 111% cpu 3.296 total
$ time with_my python3 -c 'from my.bluemaestro import get_dataframe; print(get_dataframe())'
                     temp
dt
2018-07-15 02:57:00  24.3
2018-07-15 02:58:00  24.3
2018-07-15 02:59:00  24.3
2018-07-15 03:00:00  24.3
2018-07-15 03:01:00  24.3
...                   ...
2019-07-27 10:42:00  23.8
2019-07-27 10:43:00  23.8
2019-07-27 10:44:00  23.8
2019-07-27 10:45:00  23.8
2019-07-27 10:46:00  23.8

[549054 rows x 1 columns]
with_my python3 -c   16.03s user 0.37s system 102% cpu 16.019 total

STRT [C] temperature during sleep analysis cachewpublish

CREATED: [2019-08-04]

TODO [C] Might be actually worth a separate post; using it in promnesia and axol as well cachewpublishtoblog

CREATED: [2020-01-12]

[C] * readme/docs cachew

[B] [2020-10-09] karlicoss/cachew: Transparent and persistent cache/serialization powered by type hints cachew

During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.

ugh, grammar is a bit odd

TODO [C] add autocomplete docs? cachewliterate

CREATED: [2019-08-05]

STRT [C] Come up with a decent example.. cachew

CREATED: [2020-01-05]

Maybe even dal is fine if I illustrate it by integration test?

[2020-01-08] pdf annotations could be a really good one. MASSIVE difference cachew

DONE Use ipynb for docs? cachewipythonliterate

CREATED: [2019-08-15]

[2019-08-18] pretty nice actually! cachewipythonliterate

DONE generate readme from unit tests? cachewliterate

CREATED: [2019-08-11]

TODO [B] [2020-11-14] karlicoss/cachew: Transparent and persistent cache/serialization powered by type hints cachew

add a super simple, trivial example. just with some dictionaries maybe?

STRT [D] example could be merging of highlights from different sources, e.g. kobo and kindle cachewtoblog

CREATED: [2019-04-21]

TODO [C] post example log? cachew

CREATED: [2019-07-27]

[2019-07-30] err, log of what? cachew

STRT [C] fuck, if I want people to use it, I'm gonna need some documentation… cachew

CREATED: [2019-07-30]

[C] * performance & profiling cachewperformance

generally it's fast enough or at least 'much faster', not that it's super high priority…

TODO [D] some old experiment on speeding up cachewperformance

CREATED: [2019-08-11]
from sqlalchemy.interfaces import PoolListener # type: ignore
# TODO ugh. not much faster...
class MyListener(PoolListener):
    def connect(self, dbapi_con, con_record):
        pass
        # eh. doesn't seem to help much..
        # dbapi_con.execute('PRAGMA journal_mode=MEMORY')
        # dbapi_con.execute('PRAGMA synchronous=OFF')
# self.db = sqlalchemy.create_engine(f'sqlite:///{db_path}', listeners=[MyListener()])

[C] [2019-07-25] profiling cachewperformance

testdbcachemany

de8b67cd0896e0b7512d276a5bb0fc9784ea9a49

100K: about  3.0 seconds
500K: about 15.5 seconds
  1M: about 29.4 seconds

after updating to nice binders

100K: about  3.2 seconds
500K: about 15.6 seconds
  1M: about 31.5 seconds

[2019-07-30] I haven't bothered much with profiling and optimizing since for now the benefits of using this are clear cachewperformance

TODO [D] some old comments cachewperformance

CREATED: [2019-07-27]
logger.debug('inserting...')
from sqlalchemy.sql import text # type: ignore
from sqlalchemy.sql import text
nulls = ', '.join("(NULL)" for _ in bound)
st = text("""INSERT INTO 'table' VALUES """ + nulls)
engine.execute(st)
shit. so manual operation is quite a bit faster??
but we still want serialization :(
ok, inserting gives noticeable lag
thiere must be some obvious way to speed this up...
pylint: disable=no-value-for-parameter
logger.debug('inserted...')

[D] * bugs/stability cachew

generally bugs not a big problem since the cache is temporary & optional, worst case can delete or disable
although need to make sure there are not data consistency issues… maybe expire cache on calendar?

TODO [A] rename 'table' to 'data'? to avoid quoting issues cachew

CREATED: [2021-03-09]

TODO [B] warn when it's running under tests? not sure cachew

CREATED: [2020-08-22]

STRT [C] hmm, on first initialisation in case of error it shouldn't initialise cache.. cachewpromnesia

CREATED: [2019-08-11]

[2021-03-20] ok, it behaves correctly so not such bit issue… cachewpromnesia

TODO [C] add a test for schema change cachew

CREATED: [2020-10-15]

TODO [C] hmm, with overlays, module gets a weird prefix.. cachewhpi

CREATED: [2020-10-09]

TODO [C] seems that it may delete whole directories?? cachew

CREATED: [2020-07-31]

TODO [C] hmm… can generate shitty names? cachew

CREATED: [2021-02-21]
~/.cache/my/my.core.core_config:test_cachew.\<locals\>.cf

TODO [B] I guess binder for namedtuples is kinda a separate thing as it could be used separately for 'pickling' cachew

CREATED: [2019-08-03]

STRT [B] ok, thinking about default paths cachew

CREATED: [2020-07-31]

Ok, so in 99% of cases it's enough to use the default directory, this will make everything much easier.

  • [ ] sometimes you'd want to share a cache between computers? Make sure it works over a symlink?

It's annoying to have cache settings in every data provider and in 99% the default is fine.
So have a global HPI cache setting.

  • [ ] unclear how to propagate the cache directory down to providers (e.g. reddit. ugh)

But would be nice to be able to customize the cache in advance.
Maybe, set the attribute to the function? Seems good enough?

  • [X] possibly need to create the parent dir automatically?
  • [ ] /var/tmp/cachew is better as the default? surfives through
  • [ ] if the path is relative, to it relatively to base dir, not cwd

[D] related cachewpythonhpi

TODO [A] [2021-03-08] calpaterson/pyappcache: A library for application-level caching cachew

Pyappcache is a library to make it easier to use application-level caching in Python.

    Allows putting arbitrary Python objects into the cache
    Uses PEP484 type hints to help you typecheck cache return values
    Supports Memcache, Redis and SQLite

TODO [2019-10-01] github commits integrate well with my. package. also could demonstrate cachew? cachew

STRT [B] [2019-10-01] demonstrate cachew on pdfs? cachewhpi

Jump to search, settings & sitemap