holoviz/holonote

Proposal for new Annotator API

Closed this issue · 18 comments

hoxbro commented

This issue is an alternative to the change proposed in #9 and will only discuss the public API of the Annotator class, which will be the hardest to change after the official release of HoloNote.

Goals

The goals for improving the Annotator class should be:

  • Simple API: Users should only have to interact with a single object since coordinating multiple objects is awkward.
  • Separation of concerns: All annotation-related APIs should live on the Annotator, and the persistence layer (i.e., the Connector) should be easily swappable and have only one task, which is to persist the output of the annotator.
  • Minimize the number of objects: We do not want the user to set up a separate object only for the purpose of coordinating annotations across multiple dimensions. It’s awkward and adds no benefit.
  • Loose coupling: Annotators should not be tied to a specific element; what defines them is the dimensions being annotated and their types, and not a specific instance of an element. You should be able to annotate across partially shared axes.

Current approach

With the current approach, there are three short-comings

  • Separation of concerns between Annotator and Connector.
  • Multiple elements with a single Annotator are not well supported. This leads to a complex setup with multiple Annotators and a shared Connector.
  • Annotator instances can only handle one element.

I have, in issue #6 (commit: c729118), started the process of separating the functionality of Annotator and Connector. These changes are almost feature-complete with the main branch (commit: af0138d). The missing feature is that it is not possible to commit two different elements with different Annotators as a single commit. The lack of this feature is because the main branch is using the Connector as the negotiator between the different Annotators by having methods like add_annotation on it, which is what the PR is actively trying to remove.

Currently, this is how it is done on the main branch. The code is split into four steps, which will be referenced later.

# Step A) Imports, load data, and create elements - always the same
importcurve = hv.Curve(..., kdims="TIME")
image = hv.Image(..., kdims=["A", "B"])

# Step B) Setting up the connector and annotations
connector = SQLiteDB(table_name='Idecided')
curve_annotator = Annotator(curve, connector=connector, fields=['description'])
image_annotator = Annotator(image, connector=connector, fields=['description'])

# Step C) Setting an annotation
curve_annotator.set_range(np.datetime64('2005-02-13'), np.datetime64('2005-02-16'))
image_annotator.set_range(-0.25, 0.25, -0.1, 0.1)

# Step D) Adding an annotation and committing it
connector.add_annotation(description='Annotation 2')
connector.commit()

My pain with the current implementation is that it is very complex to set up (step B) as it needs one Connector and two Annotators with the instance of the connector passed into it. This also uses the connector for adding annotations (step D).

To handle the separation of concerns, issue #9 proposed a new abstraction layer, DataSource:

# Step B)
datasource = DataSource()
image_annotator = Annotator(image, datasource=datasource)
curve_annotator = Annotator(curve, datasource=datasource)

# Step D)
datasource.add_annotation()
datasource.commit()

This proposal is a patch solution based on existing code. The approach solves the separation of concern problem, but in doing so, it is even more complicated as you need to know about a third kind of object, which is also a worry described in the issue. It also does not reduce the number of objects.

The current implementation on the main branch also has a strong coupling between an Annotator instance and an Element instance. This seems more fundamental and is not addressed in #9. This is where a lot of the complexity in the setup stems from.

New proposal

This new proposal will build upon the work already made in #6, which separates the Annotator and Connector. The first thing would be to remove the strong coupling between an annotator and an element. In practice, this will be done by either having a class method extracting relevant information from an element (or, alternatively, the __init__ method should only extract it). By removing the single-element constraint, we can make the Annotator accept specs from multiple elements. This change will mean that step B would look like this:

annotator = Annotator.from_elements(curve, image, fields=["description"])
# or
spec = {"TIME": np.datetime64, "A": float, "B": float}
annotator = Annotator(spec, connector=connector, fields=["description"])

The spec dictionary shows the current implementation and could change to incorporate extra information if it is a point or range type.

Step C will change so it only accepts keyword arguments:

annotator.set_range(TIME=(np.datetime64('2005-02-13'), np.datetime64('2005-02-16')))
annotator.set_range(A=(-0.25, 0.25), B=(-0.1, 0.1))

These changes will also solve the problem currently seen in #6. As step D will just be:

annotator.add_annotation(description='Annotation 2')
annotator.commit()

With this change, we only need one object to orchestrate the annotations making for a simpler API. This also means the Annotator instance will have all available information and can communicate with the connector instance without a third-party object.

Applying annotations to elements should be easy and intuitive if you know HoloViews. Multiplying the annotator on an element should automatically detect the element's key dimension(s) and overlay the matching annotations correctly.

annotator * curve + annotator * image

For this proposal, we will need to rewrite the Annotator class to remove the strong coupling of the element and add the proposed features mentioned in this section.

I strongly support the above proposal. The four goals that Simon outlines above are crucial to making Annotators intuitive to use and allow us to structure the code in a composable way. An Annotator is defined by the dimensions you want to annotate and their dtypes, not by any specific element that shares those dimensions. By declaring this specification ahead of time and having a single annotator support multiple dimensions I believe annotators will be much easier to use and even complex multi-dimensional spaces can be annotated in an intuitive way.

Any proposal that relies on an additional object such as the Connector in the original implementation or the DataSource in the alternative proposal will imo be awkward to use and will not allow for proper separation of concerns.

This proposal will also make it possible to annotate multiple plots with shared dimensions, e.g. imagine you have a 1D plot of time and a 2D plot of time and depth. With multiple annotators there was no way to share the time dimension, while this proposal allows for this in a very convenient and natural way.

Overall I see the proposal as a generalization of what was already there that only has upsides and zero downsides. Given that I have spent only limited time considering the problem I might have missed something but every angle I've approached this from I find this approach much cleaner both from a usability and a developer/maintenance perspective.

The proposal:

  • Makes the API more approachable for users
  • Reduces complexity because now there's just one primary object (Annotator)
  • Adds flexibility and makes it easier to annotate across partially shared axes.

Here is psuedo code thinking through how the new proposal would look in practice:

annotations_df = pd.DataFrame({
    'time_start': [10, 30, 60],
    'time_end': [20, 40, 70],
    'description': ['thing1', 'thing2', 'thing1']
})

curve = hv.Curve((time, eeg_data), kdims=["Time", "Amplitude"])
image = hv.Image((time, eeg_data), kdims=["Time", "Channel"])

connector = SQLiteDB(table_name='something')
spec = {"TIME": np.datetime64, "Amplitude": float}
annotator = Annotator(spec, connector=connector, fields=["description"])

for index, row in annotations_df.iterrows():
    annotator.set_range(Time=(row['time_start'], row['time_end']))
    annotator.add_annotation(description=row['description'])

annotator.commit()

annotated_curve = annotator * curve
annotated_image = annotator * image

annotated_curve + annotated_image

It looks good and simple enough to me! I'm in strong support.

The main remaining aspects that I think might need to be considered in the future are how this implementation will lend itself to various UI and styling accommodations, but that's for a later time.

One thing that I'm not clear on yet both in this proposal and in the current implementation is how you specify whether the annotation along a specific dimension should be expressed as individual points (i.e. HLines/VLines), ranges (i.e. HSpans/VSpans) and/or 2D geometries.

It looks great to me! No notes. :-)

In Demetris example, how is this supposed to work?

annotated_curve = annotator * curve
annotated_image = annotator * image

Assuming ranges, the first one should display as VSpans (as it is 1D with a time dimension on the xaxis) and the second one should use rectangles (now it is a 2D element). I think this is the last point @philippjfr made and I do think this is important to resolve before continuing.

Also, everyone seems to think that Annotators are deeply tied with elements which is simply not true.

An Annotator is defined by the dimensions you want to annotate and their dtypes, not by any specific element that shares those dimensions.

This is already the case: there is absolutely minimal use of self.element (or element at all) in Annotators. Also, use of element at all is already optional and you can already use spec instead. While I think this API is worth considering (once my question above is resolved) a lot of what is being attacked about the current implementation is a straw man imho and the new proposal makes no improvement in this regard.

I would say the core proposal here is that an Annotator can be extended to accept the specs for more than one visual element at once. This does seem reasonable to me, but then I need to know how you actually visualize the annotations over the various elements. I would want to see what the proposed interface is here - if is good, then allowing higher dimension specs when initializing the Annotator seems like a reasonable proposal.

Rereading @hoxbro's initial comment, I think this is what jumps out to me as problematic (i.e. too uncontrolled and too magic for my taste):

Multiplying the annotator on an element should automatically detect the element's key dimension(s) and overlay the matching annotations correctly.

That is not the point I was making no, that part is solved by having the annotator infer the appropriate dimensions from the element inside the mul operation. My point was equally valid in the old implementation afaik, i.e. in the 1d case how do you tell it whether the annotation will be a range or a single point, and in the 2d case how do you specify if it'll be a rectangle or a polygon. This is really more of an aside though and can be solved easily enough.

I also don't see the strawman, it is true that a single element is tied to an annotator, whether it is used minimally or not that is still part of the design.

The parts that are important here are primarily about annotating of shared dimensions and coordinating multiple annotators being awkward and I don't see how these are strawmen in any way.

i.e. in the 1d case how do you tell it whether the annotation will be a range or a single point, and in the 2d case how do you specify if it'll be a rectangle or a polygon. This is really more of an aside though and can be solved easily enough.

This is already solved by setting the region_type.

I also don't see the strawman, it is true that a single element is tied to an annotator, whether it is used minimally or not that is still part of the design.

Ok, what is ambiguous here is a HoloViews element instance is not tied to the implementation. If we are clear we are talking about the specs of a single element, then yes, I agree.

The parts that are important here are primarily about annotating of shared dimensions and coordinating multiple annotators...

Agreed. But this is all dependent on some mechanism whereby annotators change their behavior based on the dimensionality of the thing they are overlaid upon. This is unlike anything we have done in HoloViews before and I'm not sure this is something I am comfortable with. For instance, how would I request just the annotations themselves without an element underneath?

This is unlike anything we have done in HoloViews before and I'm not sure this is something I am comfortable with. For instance, how would I request just the annotations themselves without an element underneath?

It's presumably just syntactic sugar over a method that accepts an element or a dimension spec that returns the appropriate overlay.

Are you suggesting overwriting __mul__ for Annotator?

I guess that would work if people are fine with even more semantics for operator overloading. Ideally users wouldn't think about it or notice what is going on but if the abstraction leaks it could cause more confusion.

Yes, but again that's more syntactic sugar than anything else and not at the core of the proposal, i.e. even if we decide that's too much magic this proposal makes sense to me.

The important part of the proposal is about specifying the full multi-dimensional annotation space so we can avoid coordinating multiple annotators. Do you have any specific concerns about moving in that direction? I can see specific issues that this addresses but I may also be missing new issues that arise due to that change.

The important part of the proposal is about specifying the full multi-dimensional annotation space so we can avoid coordinating multiple annotators. Do you have any specific concerns about moving in that direction?

The concern is how to map back from the high dimensional annotator object back to whatever lower dimension thing you are plotting on the screen. You say it is syntactic sugar but I consider this a key thing to understand for this proposal to work well in practice.

I admit I hadn't thought of implementing __mul__ for Annotators and it might actually be reasonable here. Because if you don't have this kind of syntactic sugar, you will need to specify which of the dimensions of the annotator are actually being used (and if you always had to do this, e.g. with a list of strings, it would be verbose and unwieldy).

The important part of the proposal is about specifying the full multi-dimensional annotation space so we can avoid coordinating multiple annotators.

This does seem like a good idea but only if the API for mapping back to the lower dimensional (sub)spaces is intuitive and easy to use. I think that if done well (e.g helpful error messages when there are name/dtype mismatches), this idea of __mul__ might do the trick (needed to avoid a more verbose explicit method call which would be ugly if that was all you could do).

I'm only spending time thinking about this aspect because the core idea (supporting higher dimensional annotators from the initial constructor) is a good one (and one I'm sure I considered with Simon at some point). I don't have any other objections to this proposal to raise right now.

As for downsides, one of the original ideas was that you could start with one plot to annotate and later add another joint plot to annotate together as a group. However, that plan didn't quite work out because you need a fixed schema by the time you commit your first annotation: you only had some flexibility in deciding what you were annotating up until that point.

In the end, that isn't much flexibility at all so you might as well declare the full dimensionality of what you will be annotating when you create the initial annotator (i.e. this proposal).

Indeed and in theory this proposal could have a mechanism to expand the spec later but also then would error when committing if the table has already been set up.

I will also say that I don't like the idea of from_elements:

annotator = Annotator.from_elements(curve, image, fields=["description"])

Most people should not think about the spec as many people will just have an element to annotate. This is also an awkward signature for the common case.

I would suggest:

Annotator(*spec_or_elements, fields=["description"])

i.e. *args where you either have a single dictionary spec, OR a list of (one or more) elements. I suppose you could freely mix any number of dictionaries and elements and do a merge, even if that wouldn't be the style I would recommend! That might be worth supporting though just because it is the fully general case (the argument would then be called specs_and_elements as it wouldn't be either/or).

You can still have a classmethod and use it internally (or explicitly if you really want to) but I don't think that should be the common (and simplest) entrypoint to the system.

The classmethod I would want is Annotator.spec_from_elements (gives you the spec dictionary from the supplied elements) which would be useful when debugging things or when figuring out the shape you want for your annotator data.

While it is true that region_type currently lets you choose the type of annotation, we do need to figure out how to declare this jointly in a larger, higher-dimension spec that spans visualizations.

@philippjfr If this is the point you were trying to get to, then I do agree that this needs more thought!

The current way joint specs are represented is something like this i.e. a list of dicts (each corresponds to an element):

[{'age':datetime, 'height':float }, {'x':int, 'y':int}]

And then the region types are essentially zipped with the spec dictionaries e.g.:

['Range', 'Point']

This is a bit awkward though so let's think of some alternatives!

For instance, to combine the necessary information in a single spec literal, it could be something like:

[('Range', {'age':datetime, 'height':float }), ('Point', {'x':int, 'y':int})]

Actually, the more HoloViews-ey way of doing things is probably to compare based on the Dimension instance, which does already support a type parameter. In that case, it would be:

spec = {'Range':[kdim1, kdim2], 'Point':[kdim1, kdim2]}

Where all the dimension instances have their type declared. i.e equivalent to (say) if the types are set:

spec = {'Range':curve.kdims, 'Point':image.kdims}

I do think this is the proper way to do things long term e.g. a dimension that is Kelvin that disallows negative values in the range, would be useful information that could let you disallow negative values in the SQL schema declaration (i.e. data validation). That said, this is not something we need to tackle right away: I just want to make sure our design/API will allow us to use this enhancement later.