Add core.query/serialize?
purarue opened this issue · 15 comments
Not sure if this is something you'd be interesting in having in the core here/in utils, is the leftover helper modules I created while maintaining my fork.
As I was writing HPI_API
I added a core.serialize
and core.query
to my HPI fork as well.
It is quite magical (as it just resolves the function name with a string) but it lets me do some simple queries pretty easily, and play around with pipelines in the shell without having to worry about how to interop with python/dumping something from the REPL
https://github.com/seanbreckenridge/HPI/blob/master/my/utils/query.py
https://github.com/seanbreckenridge/HPI/blob/master/my/utils/serialize.py
and then a script that exposes that info:
https://github.com/seanbreckenridge/HPI/blob/master/scripts/hpi_query
As some examples, 5 songs I listened to recently:
$ hpi_query my.mpv history | jq -r '.[] | .path' | grep -i 'music' | head -n 5
/home/sean/Music/Radiohead/1994 - The Bends/11 - Sulk.mp3
/home/sean/Music/Radiohead/1994 - The Bends/02 - The Bends.mp3
/home/sean/Music/Nujabes/Nujabes - Metaphorical Music (2003) [V0]/10. Next View.mp3
/home/sean/Music/Earth, Wind & Fire/Earth Wind And Fire - Greatest Hits - [MP3-V0]/16 - After The Love Has Gone.mp3
/home/sean/Music/Darren Korb/Darren Korb - Transistor Original Soundtrack[2013] (V0)/14 - Darren Korb - Apex Beat.mp3
I also use this in my menu bar, to print how many calories/water I've drank today:
Like:
#!/bin/bash
# how many water I've had today
((BLOCK_BUTTON == 3)) && notify-send "$("${REPOS}/HPI/scripts/water-recent")"
HPI_QUERY="${REPOS}/HPI/scripts/hpi_query"
{
"${HPI_QUERY}" --days 1 'my.food' 'water' | jq -r '.[] | .glasses'
echo 0
} | datamash sum 1
Have even had some fun creating graphs like this in the terminal:
hpi_query my.food food | jq -r '.[] | "\(.on)\t\(.calories)"' | datamash groupby 1 sum 2 | sort | termgraph | head
2020/09/26: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1380.00
2020/09/27: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1150.00
2020/09/28: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1155.00
2020/09/29: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2200.00
2020/09/30: ▇▇▇▇▇▇▇▇▇▇▇ 870.00
2020/10/01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1070.00
2020/10/02: ▇▇▇▇▇▇ 505.00
2020/10/03: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 995.00
2020/10/04: ▇▇▇▇▇▇▇▇ 640.00
could probably remove the click
/simplejson
dependencies, they were just thrown in there because it was fast and I always have those installed anyways
Maybe query
could be a subcommand on the hpi
script? instead of installing it as a separate script
Yeah, looks awesome! Also been pondering about something like this!
Could also have something like --fzf
mode integration for even more interactivity.
Some simple terminal plots would also be useful for sanity checks (maybe not in doctor, but a separate mode like hpi stat
or something like that). Will dig up my notes later on libraries/tool I bookmarked.
simplejson
IIRC it's just backported json
(builtin) module? I.e. in json.dump/loads
there is cls
argument that can be used to pass custom codecs.. but I might be wrong.
But yeah generally I think it would be good to figure out what's the best (robust & flexible) way toy serialize/deserialize stuff into json -- it could be helpful for many things.
click
I think actually maybe it makes sense to adopt click
? It's pretty mature, and in my understanding it's a bit more 'composable' than argparse? But I maybe separately/later so it's consistent for now.
I'll take a closer look tomorrow!
IIRC it's just backported json (builtin) module
It has a couple extra options/flags that make it nice, especially its handling of namedtuples/dataclasses, and it lets you use a function instead of having to write a custom JSON encoder like the stdlib.
But, its probably better if we just wrote a custom JSON encoder using the builtin json
module as its brings in less unknowns, and gives us more fine tuning ability with special cases for namedtuple/dataclasses or pandas frames.
click
I've used it for years now, I think its good. Only downside is that since you use decorators most of the time, doing anything especially complicated requires you to use global vars/write your own decorators
This script is simple enough that I can just port to argparse
with an epilog
, but is something to consider
I'll work on a PR sometime later this week
But, its probably better if we just wrote a custom JSON encoder using the builtin json module as its brings in less unknowns, and gives us more fine tuning ability with special cases for namedtuple/dataclasses or pandas frames.
Yeah, I guess the reason why I was thinking of library is because it seems to be reinvented all over again, e.g.
Lines 534 to 554 in 02a9fb5
Or in cachew
I also have some code that essentially serializes dataclasses (although it's a bit different I guess since it also 'flattens' them out onto the database). But I guess can always find something better & switch later, probably more important to have usecases/tests so we can figure out what we want from it and avoid regressions.
Oh btw, there is also guess_datetime
:
Lines 499 to 509 in 0585cc4
datefunc
? Although yours looks more elaborate..seems to be reinvented all over again
Yeah, I've done the same in HPI_API
, and my autotui
lib and then here again.
I think simplejson
also just does the _asdict
check, but it has a couple extra flags and probably handles it nicer than another reimplementation.
If core.serialize
replaces all these other serialization/helper functions, it could probably also be used in HPI_API
.
I'll probably try implementing a custom JSON encoder to see how much more work it is, else use simplejson
.
I did quite a bit of research back when I was making the autotui library since thats essentially a JSON encoder/decoder which attaches/prompts for types.
ujson
is good for speed but it often messes up unicode chars/doesn't handle dataclasses/custom types out of the box.
simplejson
sits somewhere a bit above the regular json
module, handling namedtuple/dataclasses and giving you a nice interface to be able to extend
One I have yet to try but looks pretty good is orjson
, which describes itself as: fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
orjson supports CPython 3.6, 3.7, 3.8, 3.9, and 3.10. It distributes x86_64/amd64 and aarch64/armv8 wheels for Linux and macOS. It distributes x86_64/amd64 wheels for Windows. orjson does not support PyPy
Is this a problem for you? I haven't come across many people who aren't using CPython, so I think its fine to use orjson is used as the default - means less of the custom serialization code has to be in HPI. If anyone is using numpy
, I believe they're under similar restrictions (i.e. CPython)
Unlike simplejson (which has been around for 15 years, and is seems to be mostly on maintenance-mode), orjson is still in development and they keep adding more types, which may be nice to have. I'd still be fine with using simplejson, if you'd prefer a pure-python solution.
Could be done by sending a warning message like logzero
does, and default back to a basic json.dumps
if its not installed.
Currently using orjson
I'm pretty much done, and its quite compact:
Regarding
Yeah, I guess the reason why I was thinking of library is because it seems to be reinvented all over again
I think the reason is because its not often super simple to convert any object, it requires checks for recursive calls and dealing with unknowns, you have to manually list out all the primitives, container types and deal with state while traversing an object, which is pretty much rewriting a JSON parser - so you implement what you think the common case is and it works almost all the time.
I think its possible to have that function which checks a bunch of primitives, it can just become pretty complicated pretty fast. Even libraries that do this - orjson
and simplejson
don't support complex numbers or frozenset, even if they're builtins. The line always seems to be drawn right before what you see as reasonable/what you stop implementing.
Left a note:
# note: it would be nice to combine the 'my.core.common.asdict' and _orjson_default to some function
# that takes a complex python object and returns JSON-compatible fields, while still
# being a dictionary.
# a workaround is to encode with dumps below and then json.loads it immediately
guess_datetime
The reason datefunc
exists is so that we're able to sort iterators by date.
The idea for datefunc
was to reduce overhead by just checking what attribute to use for the first item in the generator -- it then 'Returns a function which when called with this object returns the date'. Then that can be passed to sorted
, instead of having to find the value with datetime
on it for every object in the iterator
I've since realized that that wouldn't work for generators which have mixed types, so the options are to:
- Not support mixed typed lists, failing when the datetime isnt on the same attribute for every object
- use something closer to
guess_datetime
, which requires you to search over every item to find the datetime - relatively more complicated -- maintain a global dictionary which:
{class: function which when called on an instance of this type returns the date-like object}
That would reduce the searching for a date-like object to just once per type.
The third is probably more efficient, but I'm leaning towards implementing a combination of the first and second:
First, try to just do what I did originally, checking the first item and assuming the rest in the generator follow a similar schema. If theres an error while doing that, restart and approach it more like guess_datetime
, manually searching for the DateLike
field for each object.
This is also just only considering sorting by date -- eventually we may want to be able to specify a key and sort by that instead. results sorted by datetime just seems like a useful thing to be able to query by, especially when trying to do queries from the CLI/extract some useful info from your data
I think I should be able to use my.core.common.asdict
in datefunc
, instead of the dir
hack I did earlier.
Nice, orjson looks good!
Is this a problem for you? I haven't come across many people who aren't using CPython, so I think its fine to use orjson is used as the default - means less of the custom serialization code has to be in HPI. If anyone is using numpy, I believe they're under similar restrictions (i.e. CPython)
Yeah, me neither. And yeah, pypi seems like a fairly esoteric requirement -- worst case always possible to fallback onto something pypi compatible (like you did with json
in your commit).
I think the reason is because its not often super simple to convert any object, it requires checks for recursive calls and dealing with unknowns
Yeah true, it's always domain dependent as well. I guess in our case recursive objects are fairly rare (I can't come up with any?) -- so mainly I meant handling ADT-like types.
I guess for now we're only concerned with serializing? Good to keep deserializing in mind too, although it's even trickier.
Seems that there is [JSONDecorer.object_hook](https://docs.python.org/3/library/json.html#json.JSONDecoder object_hook) which might work to some extent.
Maybe in principle modules could also provide 'extra' type bindings (so extra default/object_hook) if they do some complicated types? That way would be possible to keep core simple.
so mainly I meant handling ADT-like types
Yeah, this is true, but sometimes there could be an ADT which has an attribute which is also a NamedTuple/dataclass, and then there has to be additional code to handle that.
Maybe in principle modules could also provide 'extra' type bindings
Perhaps, but then you'd have to be importing additional modules to check whether the hook exists. I had two other ideas (could potentially implement both, are easy to do):
Edit: oh, yeah, it seems that thats essentially what you just said, I couldn't make out what so extra default/object_hook
meant before I wrote out my own explanation
- Just like
_asdict
with namedtuples, any NT/dataclass thats defined in HPI could optionally implement a_serialize
function, which returns a serialized version of the data, with any complex types removed/handled. That attribute could be checked for in the_orjson_default
function. - The user can optionally pass an additional
default
function todumps
here which is used in addition to_orjson_default
. something like:
def _orjson_default(obj: Any, default: Optional[Callable[[Any], Any]] = None) -> Any:
.... other things handled by default
if hasattr(obj, '_serialize') and callable(obj._serialize):
return obj._serialize()
if default:
return default(obj) # this function has to raise a TypeError if it can't serialize
raise TypeError...
The idea for datefunc was to reduce overhead by just checking what attribute to use for the first item in the generator
Oh nice! Also something I thought about, but didn't get to do.
I think mixed types are useful at the very least because of Exception
(error handling). But also for some data providers you're merging multiple different Python types, but they are the same in terms of monkey typing (i.e. even have different attributes, so they might have different sets of fields).
If theres an error while doing that, restart
I guess need to be careful here because of iterables. Possible to use itertools.tee
, but it means consuming more memory (not an issue in most cases, but still). Or alternatively need to actually call the iterable 'provider' again to get a fresh one?
relatively more complicated -- maintain a global dictionary which
Yeah, I think it's the most robust even if somewhat complicated? One thing to keep in mind is that at this point the actual types might be 'erased' (if we're processing the 'json objects'), but in that case maybe can use 'key sets' as a proxy? (not sure about the performance though).
But maybe a hybrid approach you're suggesting is good to start with. Perhaps it could accept a 'hint' -- so if someone really cares about the performance for a particular usecase they could provide it.
This is also just only considering sorting by date -- eventually we may want to be able to specify a key and sort by that instead. results sorted by datetime just seems like a useful thing to be able to query by, especially when trying to do queries from the CLI/extract some useful info from your data
Yep! I guess makes sense to prototype on datetimes, maybe possible to generalize later.
any NT/dataclass thats defined in HPI could optionally implement a _serialize
I guess classes are often 'forwarded' from the original modules (like 'data access layers'), so this would require setting attributes dynamically on these classes?
Alternatively could either define a hook in the module (e.g. it could return an extra dict of type -> serializer mappings).. Or just allow explicitly registering the hooks from within the module (i.e. from my.core.serialize import register_json_hook; register_json_hook(MyType, ...)
). Not sure which is best?
classes are often 'forwarded' from the original modules
Ah right. Could also use a combination of all three of these approaches. The hook approach also seems fine, would just require looping over the (I assume top-level) dict in the module in default
function.
Defining a hook in the source module seems a bit too hacky/magical, as my.core.serialize
may need importlib
machinery then?
Will try to implement
would just require looping over the (I assume top-level) dict in the module in default function.
Yeah, something like this? You mean the top level dict in my.core.serialize
?
I guess in principle would be nice to only have these custom types there over the course of the call of the serializing, e.g. perhaps possible to achieve via decorating entries()
or something like that, to minimize the pollution of global serializers namespace (and potential conflicts?).
But probably in most cases any HPI call only uses a single data provider, so hopefully it won't be an issue.
Defining a hook in the source module seems a bit too hacky/magical,
Yeah good point -- the upside would be keeping it a bit more declarative, but maybe too complicated for now. Hopefully in 90% cases it will be possible to get away with default serialization anyway.
You mean the top level dict in my.core.serialize
Yeah
I guess in principle would be nice to only have these custom types there over the course of the call of the serializing
yeah. maybe contextlib
works here? I haven't used with
blocks much personally
Probably won't implement the register_json_hook
yet, because most of the time it won't be needed (agreed, in 90% cases it will be possible to get away with default serialization, I've been using this for months as has a similar method for HPI_API). Also maybe my.core.init
will have some shared hook machinery with the register_json_hook
.
yeah. maybe contextlib works here? I haven't used with blocks much personally
Yeah via contextlib, with
here would be a bit intrusive. So ideally you can decorate the function with a small hint so it doesn't change its implementation
Also maybe my.core.init will have some shared hook machinery with the register_json_hook.
Yep, makes sense!
Switched HPI_API
over to use my.core.serialize
instead:
Only remaining task for this issue is to create a CLI
for query. to combine my.core.query.select
and my.core.serialize.dumps
, (and probably creating a couple helper functions in my.core.query
to glue the two together)
I think I can implement that well enough in argparse
for now, and the switch to click
can be later