Discussion: extending base HPI/overlays/overrides

Question

Discussion: extending base HPI/overlays/overrides

karlicoss opened this issue 4 years ago · 10 comments

Related issues: #12, #46; but I think worth a separate discussion.

From my experience, it's pretty hard to predict how other people want to use their data:

you might miss some attributes they care about
some people want to be more paranoid or more defensive (e.g. timezone handling/None safety/etc)
they might want to do some extra filtering
they might want to merge in extra data sources or suppress existing

The list is endless! So it would be nice if it was possible to easily override small bits of HPI modules/mix in other data, etc.

The main goals are:

low effort: ideally it should be a matter of a few lines of code to override something.
good interop: e.g. ability to keep with the upstream, use modules coming from separate repositories, etc.
ideally mypy friendly. This kind of means 'not too dynamic and magical', which is ultimately a good thing even if you don't care about mypy.

Once again, I see Emacs as a good role model. Everything is really decentralized, you have some core library, you have certain patterns that everyone follows... but apart from that the modules are mostly independent.
Many people still use 'monolith' base configurations (e.g. Doom/Spacemacs), because it's kinda convenient, as long as you have a maintainer. Arguably this is what this repository is at the moment, although it's obviously not as popular as Emacs distributions.

Emacs fits these goals well:

low effort: the simplest way to confugure something is to override a variable in your config (thanks to dynamic scope, it 'just works').
You can even literally override whole functions as a means of quickly getting the behaviour you want.
good interop: yes, unless the developer broke some APIs, usually you can safely update the upstream module.

How to achieve this within HPI:

For combining independent modules together (say, something like my/youtube.py and my/vimeo.py coming from different repositories), the easiest is to use:

symlinks (at least if you have just a few files/directories to mixin)
namespace packages (more on them later)

Now, the tricky case is when you want to partially override something.
The first option is: fork & apply your modifications on top. For example: https://github.com/seanbreckenridge/HPI

effort: very straightforward
interop: merging with the upstream a bit manual, but if you use atomic commits & interactive rebase/cherry pick, should be manageable
at least not any more magical than the original repository

Not sure if there is much to discuss here, so straight to the second and a more flexible option.

Once again, we rely on namespace packages! I'll just explain on a couple of examples, since it's easier.

example: mixing in a data source

The main idea is that you can override all.py (also some discussion here), and remove/add extra data sources.
Since all.py is tiny, it's not a big problem to just copy/paste it and apply your changes.

Some existing modules implemented with this approach in mind:
(I still haven't settled on the naming. all and main as the entry point kind of both make sense)
example: my.calendar.holidays

As you can guess, this module is responsible for flagging days as holidays, by exposing is_holiday function.
As a reasonable default, it's just using the user's country of residence and flags national holidays.
However, you might also want to mix in your work vacation, and this is harder to make uniform for everyone, and it's a good candidate for a custom user override:
```
import my.orig.my.calendar.holidays as M
from   my.orig.my.calendar.holidays import *

is_holiday_orig = M.is_holiday
def is_holiday(d: DateIsh) -> bool:
    # if it's a public holiday, definitely a holiday?
    if is_holiday_orig(d):
        return True
    # then check private data of days off work
    if is_day_off_work(d):
        return True
    return False
M.is_holiday = is_holiday
```
Thanks to namespace packages, when I import my.calendar.holidays it will hit my override first, monkey patch the is_holiday function, and expose the rest intact due to import *.
For example, hpi doctor my.calendar.holiday will run against the override, reusing the stats function or any other original functions.

My personal HPI override has more example code, and I'll gradually move some stuff from this repository there as well
(for example most things in my.body don't make much sense for other people).

Things I'm not sure about with this approach:

To import the 'original' module and monkey patch it, you need some alternative way of referencing it.
- for now, I'm using a symlink (/code/hpi-overlay/src/my/orig -> /code/hpi/src/my)
  
  This is simple enough, but maintaining the symlink manually, referencing the 'original' package through my.orig .. meh.
  Also not sure what to do if there are multiple overrides, e.g. 'chain' (although this is probably a bit extreme).
- it's probably possible to do something hacky and dynamic. E.g. take __path__, remove the first entry (which would be the 'override'), and then use importlib to import the 'original' module.
  
  The downside is that it's gonna be unfriendly to mypy (and generally a bit too magical?).
- another option is to have some sort of dynamic 'hook', which is imported before anything else.
  
  In the hook code, you import the original module and monkey patch. Same downsides, a bit too dynamic and not mypy friendly, but possible.

Caveats I know of:

packages can't contain __init__, otherwise the whole namespace package thing doesn't work
you need to be careful about the namespace package resolution order. It seems that the last installed package will be the last in the import order.
- so you'd need to run pip install -e /path/to/override and then pip install -e /path/to/original (even if it's already installed).
- another option is to reorder stuff in ~/.local/lib/python3.x/site-packages/easy-install.pth manually, but it's not very robust either (although at least it clearly shows the order)
  hpi doctor my.module displays some helpful info, but it's still easy to forget/mess it up by accident.
```
 $ hpi doctor my.calendar.holidays  
 ✅ import order: ['/code/hpi-overlay/src/my', '/code/hpi/my']
```
import * doesn't import functions that start from the underscore ('private').

Possible to get around this dynamically, but would be nice to cooperate with mypy somehow..

Happy to hear suggestions and thoughts on that. Once there's been some discussion, I'll move this to doc/, perhaps.

TODOS:

also thought that it should e possible to reuse the configuration in ~/.config/my as the 'default' overlay. In fact, treating it like a proper namespace package (at the moment it's a bit of dynamic hackery) might make everything even cleaner and simpler.
find some good tutorial on monkey patching and link? Wouldn't want to duplicate the efforts twice..
add some examples of motivation for overrides, just for documentation purposes
update docs here https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#addingmodifying-modules

Answer 1 · 2020-10-26T00:21:47.000Z

@seanbreckenridge would be interested to hear your thoughts!

Answer 2 · 2020-10-26T01:09:52.000Z

Agree that emacs is a good model to use here.

Since all.py is tiny, it's not a big problem to just copy/paste it and apply your changes

Definitely think adding all.py for larger modules is good regardless, using each of the other files as sources gives modules more structure

This is simple enough, but maintaining the symlink manually, referencing the 'original' package through my.orig

Yeah, I agree this is a somewhat meh solution, though probably the easiest for the user and has the lower barrier to entry. It also doesn't require a lot of custom python code to implement.

Personally never been a huge fan of configuration based on symlinks (though I do know some people live by it, so technically its a fine solution)

it's probably possible to do something hacky and dynamic. E.g. take __path__, remove the first entry (which would be the 'override'), and then use importlib to import the 'original' module.

Feel like this causes similar issues to what the pre-pip DAL was, and now it'd be up to the user to debug this if they're trying to add code on top of HPI

another option is to have some sort of dynamic 'hook', which is imported before anything else.

Dont think you'd want to include an entire plugin manager, but I think this pattern is at least better than using importlib -- its more visible to a new user and 'overriding' a dynamic function which just has pass makes more sense to me than inspecting the import path. It does add some boilerplate to every module, but that's just what this solution would require.

Despite my disdain for symlinks, I do think they're the easiest solution here. I'm trying to balance:

How easy it is for someone to change something without forking
- At the beginning, you can just use individual symlinks for personal modules/directories, so moving on to an overlay which uses a symlink to the original repo seems like natural progression
How much boilerplate a hook/function-based system would add to HPI. May make modules less readable and not as 'standalone'

I'm leaning towards the former, as it gives the user more flexibility.

At some point, even an overlay may not give you enough customizability over core components/functionality you may want to change. If that cant be fixed by a monkey patch, thats when you fork, I suppose.

Answer 3 · 2020-10-26T20:34:24.000Z

Dont think you'd want to include an entire plugin manager

Yeah, at least certainly not reinvent from scratch!

I guess another important principle I keep in mind is that there shouldn't be any 'hard' requirements for base modules like special base classes/interfaces/etc -- should be as code agnostic and rely on common Python mechanisms as long as possible. That way it would be easy to plug in custom stuff with no hassle. So I'd avoid plugin systems that try to impose such structure, but would be cool to try something that's can simplify monkey patching (e.g. one example I like is patchy).

In that sense all of the approaches above: symlinks/overrides/dynamic imports work, so it's up to the 'downstream' user what to choose. But still would be nice to have some 'natural' way of doing this.

Oh, just recalled, another issue with symlinks is that it's not very pip install friendly (when it's non editable). I guess it's not that big of a problem for an overlay, but might be annoying when HPI overlay is used as a dependency (or even during continuous integration). But also possible to work around by putting a symlinks directly in site-packages/hpi-overlay as a 'last resort' measure..

Answer 4 · 2020-10-31T13:22:45.000Z

Related: I've figured out how to use default my/config.py both for the actual stubs, and as a hook to load the user config if it exists, and I quite like it, makes things even more simple 9551c89
If someone has an overlay along with their own config.py stub, and they want to type check the overlay in isolation, then I guess they would have to go through the whole 'add symlink, import *, etc' routine for config.py, otherwise it would shadow the default stubs. I guess not the end of the world (and either way having proper stubs for something is much better than nothing), but need to think about it.
First thing that comes in my mind is that in principle, it would be easier from many perspectives if each module had its own config.py stub, e.g. my.reddit.config only containing the relevant Reddit variables, or my.instapaper.config. The downside is that it's kinda annoying when your actual config is scattered across many files.. So.. it's posssible to have the best of both worlds and hack together some dynamic magic that 'picks out' the relevant bits from your config.py into atomic modules, e.g. the file:

config instapaper:
    export_path = '...'

config reddit:
    export_path = '...'

in runtime corresponds to the module my.config. We could simply alias my.reddit.config and my.instapaper.config to my.config, and then everything would work.
But need to think it through first, maybe can avoid dynamic magic here as well.

Answer 5 · 2021-03-09T20:26:52.000Z

Just as an update on my current state with HPI/extending this.

Is to be expected with a fork that diverges, but its become more difficult to merge changes back in from here back into my fork.

At some point in the future, I will probably experiment with symlinks/overrides to see what issues I run into there.

i.e. going to try and patch/merge/combine my changes in some way onto a fresh fork of this repo

I will probably try to do symlinks first, as it seems to be the easiest to implement (let me know if you think otherwise), but still not sure on the long-term usability of that

As a tangential aside:

The more I develop my own modules, the more I feel it would be nicer if it could be completely decoupled.

Taking my discord module as a basic example:

I pretty much always write a package, that provides the functionality irrespective of HPI:

https://github.com/seanbreckenridge/discord_data

And then the corresponding 'HPI connection file' file, which handles grabbing the path to the data from the config, any possible caching, and helper/utility functions based on the domain, and calls the underlying function from the library.

https://github.com/seanbreckenridge/HPI/blob/master/my/discord.py

If how things are cached/discovered is all limited to my.core, then 'HPI connection files' can import my.core... instead of relative imports, and the files don't have to live in the directory structure (or at least they could be distributed/shared in separate repos, and its not all forks/symlinks). That sort of discovery does feel like the sort of problem a plugin manager solves, but I'm not fully advocating for one.

For the sake of it, to further explore the possibility; one could have a block in the config where you can point to (a path on your system/it's done in python (my.config) anyways so you can choose to resolve however you want) additional 'HPI connection files', which get mangled like my.config to my.module_name. It probably still runs into mypy issues, but I just thought I'd bring it up.

Answer 6 · 2021-03-11T05:50:07.000Z

Yeah, sorry about it, I moved some things recently (hopefully in the direction of simplifying). But yeah, symlinks are probably the easiest way.

The more I develop my own modules, the more I feel it would be nicer if it could be completely decoupled.

Yes, I totally agree! (this is kind of what I mean about data access layer).
The main reason there is still some code which does data 'interpretation' in HPI is just laziness/general hassle of scattering things across repositories. But for the most 'common' ones like Reddit/Github/instapaper the HPI bits ('connection files') are fairly thin. Definitely could do that more for GDPR modules as well.
It's kind of a shame extracting in a separate repository has such overhead -- e.g. setup.py, ideally github actions (and all the corresponding configs). Maybe there could be some 'superepository' which imports all of them as submodules, and runs CI checks just for the ease of config maintenance.

If how things are cached/discovered is all limited to my.core, then 'HPI connection files' can import my.core... instead of relative imports

Yep, good point, in fact -- it would probably be better to start gradually using my.core since eventually it will be in a separate package> (relative imports will still work, but would be kinda awkward)

one could have a block in the config where you can point to

Hmm. Maybe not a bad idea? Could be files or anything else that needs to be added to PYTHONPATH, basically. Yeah, mypy won't work with these, but I guess many people don't care about it anyway, so it could be nice to have this way of setting things up. Definitely easier than messing with editable installs for some people.

Answer 7 · 2021-03-14T15:54:53.000Z

After trying and disliking how much had to be changed with a symlink based approach, have migrated towards a namespace-based approach:

https://github.com/seanbreckenridge/HPI

It does require me to manually edit the easy-install.pth though... unsure if theres some way to add something to hpi doctor to check for that

Answer 8 · 2021-04-11T06:25:03.000Z

Created a naive implementation to manage my easy-install.pth file:

https://github.com/seanbreckenridge/reorder_editable

Isnt a perfect solution, but added a check in my.config, so it should catch it whenever I happen to break the order of my easy-install.pth and warn me/fix it if possible: https://github.com/seanbreckenridge/dotfiles/blob/a1a77c581de31bd55a6af3d11b8af588614a207e/.config/my/my/config/__init__.py#L16-L30

Answer 9 · 2022-04-27T05:54:45.000Z

all.py has sort of become the standard here, see my overlay for an example

Related PR: #238

Answer 10 · 2022-06-01T22:31:05.000Z

Now we have a generic helper that is relatively easy to use when we switch more modules to all.py and want to deprecate __init__.py in a backwards compatible way 9461df6