stac-utils/pystac

[RFC] Revert deprecation of Label extension

fmigneault opened this issue ยท 10 comments

Revert: #1270
Relates to: #1249

There is currently a lot of activity regarding ML/AI Extensions in STAC.

Provided that label is a core foundation of annotations (when combined with classification, raster and ml-aoi) to form datasets employed by ML, I strongly believe that deprecating it (literally 1 month before recent work), was a mistake.

I would like to propose reverting the deprecation, such that the STAC Community can have a common reference to define labels, rather than custom extensions without guidance. It is not quite clear to me which are the "significant issues with its current implementation" mentioned in #1249, but I would prefer working on resolving them than abandoning label that seems to work really well. Conceptually, the STAC Items obtained make a lot of sense to me:

What was the reason for deprecating it in pystac anyway? The extension itself doesn't seem to be deprecated.

What was the reason for deprecating it in pystac anyway? The extension itself doesn't seem to be deprecated.

Explanation here, I believe we decided on it during the STAC sprint in 2023: #1249.

@gadomski

As mentioned in the original description, the only indication in #1249 is that it "has significant issues with its current implementation", and this in itself doesn't really tell us what are the problems.

Also, looking at https://stac-extensions.github.io/, there are many more extensions that are in "pilot", and they are still in pystac (see classification, grid, mgrs, pointcloud, timestamps, xarray).

"has significant issues with its current implementation", and this in itself doesn't really tell us what are the problems.

Yeah, I think that came out of a in-person conversation with @m-mohr (sorry for throwing you under the bus Matthias) ... I'm not against re-adding the label extension to pystac, but I'd be more excited about re-adding if any existing issues/problems with the extension itself were to be resolved, to reduce the amount of future changes we'd need to make to the pystac implementation.

That all being said, I'd be fine reviewing a PR to re-add label to pystac and approving/merging if it looks good.

Indeed, there are some long-standing flaws with the label extension specification that should be solved. But as war as I understood @fmigneault and @rbavery, they are working on improving the whole ML STAC ecosystem, so I assume that would be part of their work.

Yeah, I think that came out of a in-person conversation with @m-mohr (sorry for throwing you under the bus Matthias) ...

I believe I was in a different sub-group (core spec, not Python ecosystem). :-P @gadomski

Correct, we are working on it! So far I've been primarily focused on the ML Model extension, and Francis has been supporting this and improving ML AOI. We've discussed tackling a review of the STAC Label extension next and providing updated examples. Others in the #ml-stac slack channel have expressed interest in contributing, providing feedback, and building with the extensions. In particular I think @earthpulse is using it in an image annotation tool and is looking to continue using it and the ML Model extension.

I'd be happy to be a co-owner with Francis for the STAC Label extension stac-extensions/label#16.

Correct! At @earthpulse we have worked on a labeling tool that is entirely based on this extension, so it is core for us and we are very interested in helping to keep the extension updated and functional. The tool, in case it is of interest to anyone: https://pypi.org/project/scaneo/

Replying to #1249 (comment) here so not to resurrect an already-closed issue.

There is some fear in the community that this extension is deprecated, so it would be good to talk about this topic.

The extension itself (https://github.com/stac-extensions/label) is not deprecated, and can still be used via Python w/ direct dictionary access, e.g.:

item.stac_extensions.append("https://stac-extensions.github.io/label/v1.0.1/schema.json")
item.properties["label:properties"] = ["foo", "bar"]
...

The pystac implementation of the label extension was deprecated during the extension update efforts at the latest STAC sprint (#1228). During that sprint, it was determined:

For all these reasons, the extension was deprecated, but the extension is still present in pystac and can be used. There's still quite a few issues in the pystac v2.0 milestone (https://github.com/stac-utils/pystac/milestone/28) and at present there's no one actively working significant feature improvements on pystac that I know of, all to say that removal is probably far off in the future.

As for a path forward to un-deprecating label, my recommended course of action would be to:

From what I can see in issues linked in #1313 (comment), the only ambiguity is the wording of the GeoJSON requirement in Assets that does not make sense when label:type = raster is specified, since a GeoTiff, COG or equivalent data matrix is expected instead. That can be changed in the spec's README without impact with how it is implemented in pystac.

The biggest overlap (by name only really) between the extensions is label:classes and classification:classes. However, they are not actually equivalent. Property classification:classes explicitly refers to "Classes stored in raster or bands" while label:classes can refer to either rasters or geometries. When labels represent classifications as a whole over an area (i.e.: label:type = vector) rather than pixel-wise classes (i.e.: label:type = raster), this distinction is very useful and only possible via label. Furthermore, label:classes has the capability to list multiple value strings/numbers for a given class, while classification:classes enforces a single integer. The label:classes.classes array is useful for representing all possible values of a given class, which could be the integer pixel value, but also a class or category index (e.g.: the class and super-category IDs in COCO dataset, which are not continuous indices), or even the literal strings of those classes/categories. It can be used to define more annotations groups as well, which can be extremely useful when a taxonomy of classes gets in the few-hundreds-classes range, or when combining multiple dataset sources that use different class name/index conventions. Depending on how the annotation was performed, all these combinations of integer/string representations could be valid.

Another clear distinction is that label:properties can be used to provide multiple classification properties over a given AOI. For example, if a lake was annotated with label:properties: ["CLASS", "CATEGORY", "WATER_QUALITY"], it could indicate that is it CLASS: lake, CATEGORY: water-body and WATER_QUALITY: 0.95 at the same time. To my knowledge, there would be no way to do this with classification extension beside using many Assets as pixel-wise masks for each combination of those respective labels and classes. That would be more complex to interpret, and much bigger from a data storage point of view than a single GeoJSON that combines everything.

The classificiation extension also focuses on the pixel values themselves, but not really what they are used for. The label extension has an additional use which is to describe the annotation campaign itself. Details can be provided whether the labels were generated manually or an automated process (combined with https://github.com/stac-extensions/processing and https://github.com/crim-ca/dlm-extension, that can help with data lineage of derived products), statistics about them, and intended tasks.

Personally, I do not think that having distinct extensions for label:type = vector or label:type = raster would help. If anything, that could bring more confusion since many fields would be duplicated (just like there is confusion about label:classes and classification:classes although they represent different concepts). I believe it is only a matter of making sure the examples corresponds to the specification text (to avoid the ambiguity they create), just like stac-extensions/label#12 describes, and adding more examples with explanations how to combine label with raster, eo, classification extensions for specific use cases just like https://github.com/stac-extensions/classification/blob/main/examples/item-classes-maxar.json does.

Sharing some screenshots from a partner's notebook that uses pystac.extensions.label and that performs validation successfully against the resulting STAC Item. From an implementation's point of view, everything seems to behave as intended. Therefore, I am still not sure where the maintenance issue is besides fixing the specification's README.

unnamed
unnamed-1

Thanks for outlining all the fixes for the extension issues I mentioned. If you can put that information over on the label extension itself (https://github.com/stac-extensions/label), that will help us clean up the actual extension โ€” this issue is just for the Python implementation of that extension. I'm happy to help move along those issues over on the label extension repo, but as I'm not actively using the extension for my own work I don't think I would be an appropriate owner.

From an implementation's point of view, everything seems to behave as intended.

Sounds good โ€” yeah, once the label extension's issues and ownership is cleared up, then it shouldn't be a problem to un-deprecate. There's still the matter of bringing the extension up to the latest best-practices mentioned in #1228, which shouldn't be a large lift but would need someone to step up and do that.