Discussion: extending base HPI/overlays/overrides
karlicoss opened this issue · 10 comments
Related issues: #12, #46; but I think worth a separate discussion.
From my experience, it's pretty hard to predict how other people want to use their data:
- you might miss some attributes they care about
- some people want to be more paranoid or more defensive (e.g. timezone handling/None safety/etc)
- they might want to do some extra filtering
- they might want to merge in extra data sources or suppress existing
The list is endless! So it would be nice if it was possible to easily override small bits of HPI modules/mix in other data, etc.
The main goals are:
- low effort: ideally it should be a matter of a few lines of code to override something.
- good interop: e.g. ability to keep with the upstream, use modules coming from separate repositories, etc.
- ideally mypy friendly. This kind of means 'not too dynamic and magical', which is ultimately a good thing even if you don't care about mypy.
Once again, I see Emacs as a good role model. Everything is really decentralized, you have some core library, you have certain patterns that everyone follows... but apart from that the modules are mostly independent.
Many people still use 'monolith' base configurations (e.g. Doom/Spacemacs), because it's kinda convenient, as long as you have a maintainer. Arguably this is what this repository is at the moment, although it's obviously not as popular as Emacs distributions.
Emacs fits these goals well:
- low effort: the simplest way to confugure something is to override a variable in your config (thanks to dynamic scope, it 'just works').
You can even literally override whole functions as a means of quickly getting the behaviour you want. - good interop: yes, unless the developer broke some APIs, usually you can safely update the upstream module.
How to achieve this within HPI:
For combining independent modules together (say, something like my/youtube.py
and my/vimeo.py
coming from different repositories), the easiest is to use:
- symlinks (at least if you have just a few files/directories to mixin)
- namespace packages (more on them later)
Now, the tricky case is when you want to partially override something.
The first option is: fork & apply your modifications on top. For example: https://github.com/seanbreckenridge/HPI
- effort: very straightforward
- interop: merging with the upstream a bit manual, but if you use atomic commits & interactive rebase/cherry pick, should be manageable
- at least not any more magical than the original repository
Not sure if there is much to discuss here, so straight to the second and a more flexible option.
Once again, we rely on namespace packages! I'll just explain on a couple of examples, since it's easier.
-
example: mixing in a data source
The main idea is that you can override
all.py
(also some discussion here), and remove/add extra data sources.
Sinceall.py
is tiny, it's not a big problem to just copy/paste it and apply your changes.Some existing modules implemented with this approach in mind:
- https://github.com/karlicoss/HPI/blob/master/my/body/sleep/main.py
- https://github.com/karlicoss/HPI/blob/master/my/rss/all.py
- https://github.com/karlicoss/HPI/blob/master/my/twitter/all.py
- https://github.com/karlicoss/HPI/blob/master/my/github/all.py
(I still haven't settled on the naming.
all
andmain
as the entry point kind of both make sense) -
example:
my.calendar.holidays
As you can guess, this module is responsible for flagging days as holidays, by exposing
is_holiday
function.
As a reasonable default, it's just using the user's country of residence and flags national holidays.
However, you might also want to mix in your work vacation, and this is harder to make uniform for everyone, and it's a good candidate for a custom user override:import my.orig.my.calendar.holidays as M from my.orig.my.calendar.holidays import * is_holiday_orig = M.is_holiday def is_holiday(d: DateIsh) -> bool: # if it's a public holiday, definitely a holiday? if is_holiday_orig(d): return True # then check private data of days off work if is_day_off_work(d): return True return False M.is_holiday = is_holiday
Thanks to namespace packages, when I import
my.calendar.holidays
it will hit my override first, monkey patch theis_holiday
function, and expose the rest intact due toimport *
.
For example,hpi doctor my.calendar.holiday
will run against the override, reusing thestats
function or any other original functions.My personal HPI override has more example code, and I'll gradually move some stuff from this repository there as well
(for example most things in my.body don't make much sense for other people).
Things I'm not sure about with this approach:
- To import the 'original' module and monkey patch it, you need some alternative way of referencing it.
-
for now, I'm using a symlink (
/code/hpi-overlay/src/my/orig -> /code/hpi/src/my
)This is simple enough, but maintaining the symlink manually, referencing the 'original' package through
my.orig
.. meh.
Also not sure what to do if there are multiple overrides, e.g. 'chain' (although this is probably a bit extreme). -
it's probably possible to do something hacky and dynamic. E.g. take
__path__
, remove the first entry (which would be the 'override'), and then useimportlib
to import the 'original' module.The downside is that it's gonna be unfriendly to mypy (and generally a bit too magical?).
-
another option is to have some sort of dynamic 'hook', which is imported before anything else.
In the hook code, you import the original module and monkey patch. Same downsides, a bit too dynamic and not mypy friendly, but possible.
-
Caveats I know of:
-
packages can't contain
__init__
, otherwise the whole namespace package thing doesn't work -
you need to be careful about the namespace package resolution order. It seems that the last installed package will be the last in the import order.
-
so you'd need to run
pip install -e /path/to/override
and thenpip install -e /path/to/original
(even if it's already installed). -
another option is to reorder stuff in
~/.local/lib/python3.x/site-packages/easy-install.pth
manually, but it's not very robust either (although at least it clearly shows the order)
hpi doctor my.module
displays some helpful info, but it's still easy to forget/mess it up by accident.$ hpi doctor my.calendar.holidays ✅ import order: ['/code/hpi-overlay/src/my', '/code/hpi/my']
-
-
import *
doesn't import functions that start from the underscore ('private').Possible to get around this dynamically, but would be nice to cooperate with mypy somehow..
Happy to hear suggestions and thoughts on that. Once there's been some discussion, I'll move this to doc/
, perhaps.
TODOS:
- also thought that it should e possible to reuse the configuration in
~/.config/my
as the 'default' overlay. In fact, treating it like a proper namespace package (at the moment it's a bit of dynamic hackery) might make everything even cleaner and simpler. - find some good tutorial on monkey patching and link? Wouldn't want to duplicate the efforts twice..
- add some examples of motivation for overrides, just for documentation purposes
- update docs here https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#addingmodifying-modules
@seanbreckenridge would be interested to hear your thoughts!
Agree that emacs is a good model to use here.
Since all.py is tiny, it's not a big problem to just copy/paste it and apply your changes
Definitely think adding all.py
for larger modules is good regardless, using each of the other files as sources gives modules more structure
This is simple enough, but maintaining the symlink manually, referencing the 'original' package through my.orig
Yeah, I agree this is a somewhat meh solution, though probably the easiest for the user and has the lower barrier to entry. It also doesn't require a lot of custom python code to implement.
Personally never been a huge fan of configuration based on symlinks (though I do know some people live by it, so technically its a fine solution)
it's probably possible to do something hacky and dynamic. E.g. take
__path__
, remove the first entry (which would be the 'override'), and then use importlib to import the 'original' module.
Feel like this causes similar issues to what the pre-pip DAL was, and now it'd be up to the user to debug this if they're trying to add code on top of HPI
another option is to have some sort of dynamic 'hook', which is imported before anything else.
Dont think you'd want to include an entire plugin manager, but I think this pattern is at least better than using importlib -- its more visible to a new user and 'overriding' a dynamic function which just has pass
makes more sense to me than inspecting the import path. It does add some boilerplate to every module, but that's just what this solution would require.
Despite my disdain for symlinks, I do think they're the easiest solution here. I'm trying to balance:
- How easy it is for someone to change something without forking
- At the beginning, you can just use individual symlinks for personal modules/directories, so moving on to an overlay which uses a symlink to the original repo seems like natural progression
- How much boilerplate a hook/function-based system would add to HPI. May make modules less readable and not as 'standalone'
I'm leaning towards the former, as it gives the user more flexibility.
At some point, even an overlay may not give you enough customizability over core components/functionality you may want to change. If that cant be fixed by a monkey patch, thats when you fork, I suppose.
Dont think you'd want to include an entire plugin manager
Yeah, at least certainly not reinvent from scratch!
I guess another important principle I keep in mind is that there shouldn't be any 'hard' requirements for base modules like special base classes/interfaces/etc -- should be as code agnostic and rely on common Python mechanisms as long as possible. That way it would be easy to plug in custom stuff with no hassle. So I'd avoid plugin systems that try to impose such structure, but would be cool to try something that's can simplify monkey patching (e.g. one example I like is patchy).
In that sense all of the approaches above: symlinks/overrides/dynamic imports work, so it's up to the 'downstream' user what to choose. But still would be nice to have some 'natural' way of doing this.
Oh, just recalled, another issue with symlinks is that it's not very pip install
friendly (when it's non editable). I guess it's not that big of a problem for an overlay, but might be annoying when HPI overlay is used as a dependency (or even during continuous integration). But also possible to work around by putting a symlinks directly in site-packages/hpi-overlay
as a 'last resort' measure..
Related: I've figured out how to use default my/config.py
both for the actual stubs, and as a hook to load the user config if it exists, and I quite like it, makes things even more simple 9551c89
If someone has an overlay along with their own config.py
stub, and they want to type check the overlay in isolation, then I guess they would have to go through the whole 'add symlink, import *, etc' routine for config.py
, otherwise it would shadow the default stubs. I guess not the end of the world (and either way having proper stubs for something is much better than nothing), but need to think about it.
First thing that comes in my mind is that in principle, it would be easier from many perspectives if each module had its own config.py
stub, e.g. my.reddit.config
only containing the relevant Reddit variables, or my.instapaper.config
. The downside is that it's kinda annoying when your actual config is scattered across many files.. So.. it's posssible to have the best of both worlds and hack together some dynamic magic that 'picks out' the relevant bits from your config.py
into atomic modules, e.g. the file:
config instapaper:
export_path = '...'
config reddit:
export_path = '...'
in runtime corresponds to the module my.config
. We could simply alias my.reddit.config
and my.instapaper.config
to my.config
, and then everything would work.
But need to think it through first, maybe can avoid dynamic magic here as well.
Just as an update on my current state with HPI/extending this.
Is to be expected with a fork that diverges, but its become more difficult to merge changes back in from here back into my fork.
At some point in the future, I will probably experiment with symlinks/overrides to see what issues I run into there.
i.e. going to try and patch/merge/combine my changes in some way onto a fresh fork of this repo
I will probably try to do symlinks first, as it seems to be the easiest to implement (let me know if you think otherwise), but still not sure on the long-term usability of that
As a tangential aside:
The more I develop my own modules, the more I feel it would be nicer if it could be completely decoupled.
Taking my discord module as a basic example:
I pretty much always write a package, that provides the functionality irrespective of HPI:
https://github.com/seanbreckenridge/discord_data
And then the corresponding 'HPI connection file' file, which handles grabbing the path to the data from the config, any possible caching, and helper/utility functions based on the domain, and calls the underlying function from the library.
https://github.com/seanbreckenridge/HPI/blob/master/my/discord.py
If how things are cached/discovered is all limited to my.core
, then 'HPI connection files' can import my.core...
instead of relative imports, and the files don't have to live in the directory structure (or at least they could be distributed/shared in separate repos, and its not all forks/symlinks). That sort of discovery does feel like the sort of problem a plugin manager solves, but I'm not fully advocating for one.
For the sake of it, to further explore the possibility; one could have a block in the config where you can point to (a path on your system/it's done in python (my.config) anyways so you can choose to resolve however you want) additional 'HPI connection files', which get mangled like my.config
to my.module_name
. It probably still runs into mypy issues, but I just thought I'd bring it up.
Yeah, sorry about it, I moved some things recently (hopefully in the direction of simplifying). But yeah, symlinks are probably the easiest way.
The more I develop my own modules, the more I feel it would be nicer if it could be completely decoupled.
Yes, I totally agree! (this is kind of what I mean about data access layer).
The main reason there is still some code which does data 'interpretation' in HPI is just laziness/general hassle of scattering things across repositories. But for the most 'common' ones like Reddit/Github/instapaper the HPI bits ('connection files') are fairly thin. Definitely could do that more for GDPR modules as well.
It's kind of a shame extracting in a separate repository has such overhead -- e.g. setup.py, ideally github actions (and all the corresponding configs). Maybe there could be some 'superepository' which imports all of them as submodules, and runs CI checks just for the ease of config maintenance.
If how things are cached/discovered is all limited to my.core, then 'HPI connection files' can import my.core... instead of relative imports
Yep, good point, in fact -- it would probably be better to start gradually using my.core
since eventually it will be in a separate package> (relative imports will still work, but would be kinda awkward)
one could have a block in the config where you can point to
Hmm. Maybe not a bad idea? Could be files or anything else that needs to be added to PYTHONPATH, basically. Yeah, mypy won't work with these, but I guess many people don't care about it anyway, so it could be nice to have this way of setting things up. Definitely easier than messing with editable installs for some people.
After trying and disliking how much had to be changed with a symlink based approach, have migrated towards a namespace-based approach:
https://github.com/seanbreckenridge/HPI
It does require me to manually edit the easy-install.pth
though... unsure if theres some way to add something to hpi doctor
to check for that
Created a naive implementation to manage my easy-install.pth file:
https://github.com/seanbreckenridge/reorder_editable
Isnt a perfect solution, but added a check in my.config
, so it should catch it whenever I happen to break the order of my easy-install.pth
and warn me/fix it if possible: https://github.com/seanbreckenridge/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L16-L30