karlicoss/HPI

Add core.query/serialize?

purarue opened this issue · 15 comments

Not sure if this is something you'd be interesting in having in the core here/in utils, is the leftover helper modules I created while maintaining my fork.

As I was writing HPI_API I added a core.serialize and core.query to my HPI fork as well.

It is quite magical (as it just resolves the function name with a string) but it lets me do some simple queries pretty easily, and play around with pipelines in the shell without having to worry about how to interop with python/dumping something from the REPL

https://github.com/seanbreckenridge/HPI/blob/master/my/utils/query.py
https://github.com/seanbreckenridge/HPI/blob/master/my/utils/serialize.py

and then a script that exposes that info:

https://github.com/seanbreckenridge/HPI/blob/master/scripts/hpi_query

As some examples, 5 songs I listened to recently:

$ hpi_query my.mpv history | jq -r '.[] | .path' | grep -i 'music' | head -n 5
/home/sean/Music/Radiohead/1994 - The Bends/11 - Sulk.mp3
/home/sean/Music/Radiohead/1994 - The Bends/02 - The Bends.mp3
/home/sean/Music/Nujabes/Nujabes - Metaphorical Music (2003) [V0]/10. Next View.mp3
/home/sean/Music/Earth, Wind & Fire/Earth Wind And Fire - Greatest Hits - [MP3-V0]/16 - After The Love Has Gone.mp3
/home/sean/Music/Darren Korb/Darren Korb - Transistor Original Soundtrack[2013] (V0)/14 - Darren Korb - Apex Beat.mp3

I also use this in my menu bar, to print how many calories/water I've drank today:

image

Like:

#!/bin/bash
# how many water I've had today

((BLOCK_BUTTON == 3)) && notify-send "$("${REPOS}/HPI/scripts/water-recent")"

HPI_QUERY="${REPOS}/HPI/scripts/hpi_query"
{
	"${HPI_QUERY}" --days 1 'my.food' 'water' | jq -r '.[] | .glasses'
	echo 0
} | datamash sum 1

Have even had some fun creating graphs like this in the terminal:

hpi_query my.food food | jq -r '.[] | "\(.on)\t\(.calories)"' | datamash groupby 1 sum 2 | sort | termgraph | head

2020/09/26: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1380.00
2020/09/27: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1150.00
2020/09/28: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1155.00
2020/09/29: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2200.00
2020/09/30: ▇▇▇▇▇▇▇▇▇▇▇ 870.00
2020/10/01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1070.00
2020/10/02: ▇▇▇▇▇▇ 505.00
2020/10/03: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 995.00
2020/10/04: ▇▇▇▇▇▇▇▇ 640.00

could probably remove the click/simplejson dependencies, they were just thrown in there because it was fast and I always have those installed anyways

Maybe query could be a subcommand on the hpi script? instead of installing it as a separate script

Yeah, looks awesome! Also been pondering about something like this!
Could also have something like --fzf mode integration for even more interactivity.

Some simple terminal plots would also be useful for sanity checks (maybe not in doctor, but a separate mode like hpi stat or something like that). Will dig up my notes later on libraries/tool I bookmarked.

simplejson

IIRC it's just backported json (builtin) module? I.e. in json.dump/loads there is cls argument that can be used to pass custom codecs.. but I might be wrong.
But yeah generally I think it would be good to figure out what's the best (robust & flexible) way toy serialize/deserialize stuff into json -- it could be helpful for many things.

click

I think actually maybe it makes sense to adopt click? It's pretty mature, and in my understanding it's a bit more 'composable' than argparse? But I maybe separately/later so it's consistent for now.


I'll take a closer look tomorrow!

IIRC it's just backported json (builtin) module

It has a couple extra options/flags that make it nice, especially its handling of namedtuples/dataclasses, and it lets you use a function instead of having to write a custom JSON encoder like the stdlib.

But, its probably better if we just wrote a custom JSON encoder using the builtin json module as its brings in less unknowns, and gives us more fine tuning ability with special cases for namedtuple/dataclasses or pandas frames.

click

I've used it for years now, I think its good. Only downside is that since you use decorators most of the time, doing anything especially complicated requires you to use global vars/write your own decorators

This script is simple enough that I can just port to argparse with an epilog, but is something to consider

I'll work on a PR sometime later this week

But, its probably better if we just wrote a custom JSON encoder using the builtin json module as its brings in less unknowns, and gives us more fine tuning ability with special cases for namedtuple/dataclasses or pandas frames.

Yeah, I guess the reason why I was thinking of library is because it seems to be reinvented all over again, e.g.

HPI/my/core/common.py

Lines 534 to 554 in 02a9fb5

def asdict(thing) -> Json:
# todo primitive?
# todo exception?
if isinstance(thing, dict):
return thing
import dataclasses as D
if D.is_dataclass(thing):
return D.asdict(thing)
# must be a NT otherwise?
# todo add a proper check.. ()
return thing._asdict()
# todo not sure about naming
def to_jsons(it) -> Iterable[Json]:
from .error import error_to_json # prevent circular import
for r in it:
if isinstance(r, Exception):
yield error_to_json(r)
else:
yield asdict(r)

Or in cachew I also have some code that essentially serializes dataclasses (although it's a bit different I guess since it also 'flattens' them out onto the database). But I guess can always find something better & switch later, probably more important to have usecases/tests so we can figure out what we want from it and avoid regressions.

Oh btw, there is also guess_datetime:

HPI/my/core/common.py

Lines 499 to 509 in 0585cc4

# experimental, not sure about it..
def guess_datetime(x: Any) -> Optional[datetime]:
# todo hmm implement withoutexception..
try:
d = asdict(x)
except:
return None
for k, v in d.items():
if isinstance(v, datetime):
return v
return None
, kinda similar to datefunc? Although yours looks more elaborate..

seems to be reinvented all over again

Yeah, I've done the same in HPI_API, and my autotui lib and then here again.

I think simplejson also just does the _asdict check, but it has a couple extra flags and probably handles it nicer than another reimplementation.

If core.serialize replaces all these other serialization/helper functions, it could probably also be used in HPI_API.

I'll probably try implementing a custom JSON encoder to see how much more work it is, else use simplejson.

I did quite a bit of research back when I was making the autotui library since thats essentially a JSON encoder/decoder which attaches/prompts for types.

ujson is good for speed but it often messes up unicode chars/doesn't handle dataclasses/custom types out of the box.

simplejson sits somewhere a bit above the regular json module, handling namedtuple/dataclasses and giving you a nice interface to be able to extend

One I have yet to try but looks pretty good is orjson, which describes itself as: fast, correct Python JSON library supporting dataclasses, datetimes, and numpy

orjson supports CPython 3.6, 3.7, 3.8, 3.9, and 3.10. It distributes x86_64/amd64 and aarch64/armv8 wheels for Linux and macOS. It distributes x86_64/amd64 wheels for Windows. orjson does not support PyPy

Is this a problem for you? I haven't come across many people who aren't using CPython, so I think its fine to use orjson is used as the default - means less of the custom serialization code has to be in HPI. If anyone is using numpy, I believe they're under similar restrictions (i.e. CPython)

Unlike simplejson (which has been around for 15 years, and is seems to be mostly on maintenance-mode), orjson is still in development and they keep adding more types, which may be nice to have. I'd still be fine with using simplejson, if you'd prefer a pure-python solution.

Could be done by sending a warning message like logzero does, and default back to a basic json.dumps if its not installed.

Currently using orjson I'm pretty much done, and its quite compact:

seanbreckenridge@0593c69

Regarding

Yeah, I guess the reason why I was thinking of library is because it seems to be reinvented all over again

I think the reason is because its not often super simple to convert any object, it requires checks for recursive calls and dealing with unknowns, you have to manually list out all the primitives, container types and deal with state while traversing an object, which is pretty much rewriting a JSON parser - so you implement what you think the common case is and it works almost all the time.

I think its possible to have that function which checks a bunch of primitives, it can just become pretty complicated pretty fast. Even libraries that do this - orjson and simplejson don't support complex numbers or frozenset, even if they're builtins. The line always seems to be drawn right before what you see as reasonable/what you stop implementing.

Left a note:

# note: it would be nice to combine the 'my.core.common.asdict' and _orjson_default to some function
# that takes a complex python object and returns JSON-compatible fields, while still
# being a dictionary.
# a workaround is to encode with dumps below and then json.loads it immediately

guess_datetime

The reason datefunc exists is so that we're able to sort iterators by date.

The idea for datefunc was to reduce overhead by just checking what attribute to use for the first item in the generator -- it then 'Returns a function which when called with this object returns the date'. Then that can be passed to sorted, instead of having to find the value with datetime on it for every object in the iterator

I've since realized that that wouldn't work for generators which have mixed types, so the options are to:

  1. Not support mixed typed lists, failing when the datetime isnt on the same attribute for every object
  2. use something closer to guess_datetime, which requires you to search over every item to find the datetime
  3. relatively more complicated -- maintain a global dictionary which:
{class: function which when called on an instance of this type returns the date-like object}

That would reduce the searching for a date-like object to just once per type.

The third is probably more efficient, but I'm leaning towards implementing a combination of the first and second:

First, try to just do what I did originally, checking the first item and assuming the rest in the generator follow a similar schema. If theres an error while doing that, restart and approach it more like guess_datetime, manually searching for the DateLike field for each object.

This is also just only considering sorting by date -- eventually we may want to be able to specify a key and sort by that instead. results sorted by datetime just seems like a useful thing to be able to query by, especially when trying to do queries from the CLI/extract some useful info from your data

I think I should be able to use my.core.common.asdict in datefunc, instead of the dir hack I did earlier.

Nice, orjson looks good!

Is this a problem for you? I haven't come across many people who aren't using CPython, so I think its fine to use orjson is used as the default - means less of the custom serialization code has to be in HPI. If anyone is using numpy, I believe they're under similar restrictions (i.e. CPython)

Yeah, me neither. And yeah, pypi seems like a fairly esoteric requirement -- worst case always possible to fallback onto something pypi compatible (like you did with json in your commit).

I think the reason is because its not often super simple to convert any object, it requires checks for recursive calls and dealing with unknowns

Yeah true, it's always domain dependent as well. I guess in our case recursive objects are fairly rare (I can't come up with any?) -- so mainly I meant handling ADT-like types.

I guess for now we're only concerned with serializing? Good to keep deserializing in mind too, although it's even trickier.
Seems that there is [JSONDecorer.object_hook](https://docs.python.org/3/library/json.html#json.JSONDecoder object_hook) which might work to some extent.

Maybe in principle modules could also provide 'extra' type bindings (so extra default/object_hook) if they do some complicated types? That way would be possible to keep core simple.

so mainly I meant handling ADT-like types

Yeah, this is true, but sometimes there could be an ADT which has an attribute which is also a NamedTuple/dataclass, and then there has to be additional code to handle that.

Maybe in principle modules could also provide 'extra' type bindings

Perhaps, but then you'd have to be importing additional modules to check whether the hook exists. I had two other ideas (could potentially implement both, are easy to do):

Edit: oh, yeah, it seems that thats essentially what you just said, I couldn't make out what so extra default/object_hook meant before I wrote out my own explanation

  1. Just like _asdict with namedtuples, any NT/dataclass thats defined in HPI could optionally implement a _serialize function, which returns a serialized version of the data, with any complex types removed/handled. That attribute could be checked for in the _orjson_default function.
  2. The user can optionally pass an additional default function to dumps here which is used in addition to _orjson_default. something like:
def _orjson_default(obj: Any, default: Optional[Callable[[Any], Any]] = None) -> Any:
    .... other things handled by default
    if hasattr(obj, '_serialize') and callable(obj._serialize):
        return obj._serialize()
    if default:
        return default(obj)  # this function has to raise a TypeError if it can't serialize
    raise TypeError...

The idea for datefunc was to reduce overhead by just checking what attribute to use for the first item in the generator

Oh nice! Also something I thought about, but didn't get to do.
I think mixed types are useful at the very least because of Exception (error handling). But also for some data providers you're merging multiple different Python types, but they are the same in terms of monkey typing (i.e. even have different attributes, so they might have different sets of fields).

If theres an error while doing that, restart

I guess need to be careful here because of iterables. Possible to use itertools.tee, but it means consuming more memory (not an issue in most cases, but still). Or alternatively need to actually call the iterable 'provider' again to get a fresh one?

relatively more complicated -- maintain a global dictionary which

Yeah, I think it's the most robust even if somewhat complicated? One thing to keep in mind is that at this point the actual types might be 'erased' (if we're processing the 'json objects'), but in that case maybe can use 'key sets' as a proxy? (not sure about the performance though).

But maybe a hybrid approach you're suggesting is good to start with. Perhaps it could accept a 'hint' -- so if someone really cares about the performance for a particular usecase they could provide it.

This is also just only considering sorting by date -- eventually we may want to be able to specify a key and sort by that instead. results sorted by datetime just seems like a useful thing to be able to query by, especially when trying to do queries from the CLI/extract some useful info from your data

Yep! I guess makes sense to prototype on datetimes, maybe possible to generalize later.

any NT/dataclass thats defined in HPI could optionally implement a _serialize

I guess classes are often 'forwarded' from the original modules (like 'data access layers'), so this would require setting attributes dynamically on these classes?
Alternatively could either define a hook in the module (e.g. it could return an extra dict of type -> serializer mappings).. Or just allow explicitly registering the hooks from within the module (i.e. from my.core.serialize import register_json_hook; register_json_hook(MyType, ...)). Not sure which is best?

classes are often 'forwarded' from the original modules

Ah right. Could also use a combination of all three of these approaches. The hook approach also seems fine, would just require looping over the (I assume top-level) dict in the module in default function.

Defining a hook in the source module seems a bit too hacky/magical, as my.core.serialize may need importlib machinery then?

Will try to implement

would just require looping over the (I assume top-level) dict in the module in default function.

Yeah, something like this? You mean the top level dict in my.core.serialize?
I guess in principle would be nice to only have these custom types there over the course of the call of the serializing, e.g. perhaps possible to achieve via decorating entries() or something like that, to minimize the pollution of global serializers namespace (and potential conflicts?).
But probably in most cases any HPI call only uses a single data provider, so hopefully it won't be an issue.

Defining a hook in the source module seems a bit too hacky/magical,

Yeah good point -- the upside would be keeping it a bit more declarative, but maybe too complicated for now. Hopefully in 90% cases it will be possible to get away with default serialization anyway.

You mean the top level dict in my.core.serialize

Yeah

I guess in principle would be nice to only have these custom types there over the course of the call of the serializing

yeah. maybe contextlib works here? I haven't used with blocks much personally

Probably won't implement the register_json_hook yet, because most of the time it won't be needed (agreed, in 90% cases it will be possible to get away with default serialization, I've been using this for months as has a similar method for HPI_API). Also maybe my.core.init will have some shared hook machinery with the register_json_hook.

yeah. maybe contextlib works here? I haven't used with blocks much personally

Yeah via contextlib, with here would be a bit intrusive. So ideally you can decorate the function with a small hint so it doesn't change its implementation

Also maybe my.core.init will have some shared hook machinery with the register_json_hook.

Yep, makes sense!

Switched HPI_API over to use my.core.serialize instead:

seanbreckenridge/HPI_API@8a4d77d

Only remaining task for this issue is to create a CLI for query. to combine my.core.query.select and my.core.serialize.dumps, (and probably creating a couple helper functions in my.core.query to glue the two together)

I think I can implement that well enough in argparse for now, and the switch to click can be later