[RFC] Revert deprecation of Label extension
fmigneault opened this issue ยท 10 comments
Revert: #1270
Relates to: #1249
There is currently a lot of activity regarding ML/AI Extensions in STAC.
- crim-ca/dlm-extension#2
- crim-ca/dlm-extension#7
- stac-extensions/ml-aoi#8
- All recent STAC Community meetings since December 2023
- The next STAC Community meeting (2024-03-11) where MLM related work combining DLM, ML-Model and community feedback (stac-extensions/ml-model#13) will be presented.
- Various ongoing OGC initiatives (Disasters Pilot, Open Science Persistent Demonstrator, Climate-related pilots, etc.)
Provided that label
is a core foundation of annotations (when combined with classification
, raster
and ml-aoi
) to form datasets employed by ML, I strongly believe that deprecating it (literally 1 month before recent work), was a mistake.
I would like to propose reverting the deprecation, such that the STAC Community can have a common reference to define labels, rather than custom extensions without guidance. It is not quite clear to me which are the "significant issues with its current implementation" mentioned in #1249, but I would prefer working on resolving them than abandoning label
that seems to work really well. Conceptually, the STAC Items obtained make a lot of sense to me:
- examples: https://github.com/ai-extensions/stac-data-loader/tree/main/data/EuroSAT/stac/subset
- valid interaction in STAC Browser: https://hirondelle.crim.ca/stac-browser/collections/EuroSAT-subset-train/items/EuroSAT-subset-train-sample-59-class-SeaLake (see "Labels / ML" section properties)
What was the reason for deprecating it in pystac anyway? The extension itself doesn't seem to be deprecated.
What was the reason for deprecating it in pystac anyway? The extension itself doesn't seem to be deprecated.
Explanation here, I believe we decided on it during the STAC sprint in 2023: #1249.
As mentioned in the original description, the only indication in #1249 is that it "has significant issues with its current implementation", and this in itself doesn't really tell us what are the problems.
Also, looking at https://stac-extensions.github.io/, there are many more extensions that are in "pilot", and they are still in pystac
(see classification
, grid
, mgrs
, pointcloud
, timestamps
, xarray
).
"has significant issues with its current implementation", and this in itself doesn't really tell us what are the problems.
Yeah, I think that came out of a in-person conversation with @m-mohr (sorry for throwing you under the bus Matthias) ... I'm not against re-adding the label extension to pystac, but I'd be more excited about re-adding if any existing issues/problems with the extension itself were to be resolved, to reduce the amount of future changes we'd need to make to the pystac implementation.
That all being said, I'd be fine reviewing a PR to re-add label to pystac and approving/merging if it looks good.
Indeed, there are some long-standing flaws with the label extension specification that should be solved. But as war as I understood @fmigneault and @rbavery, they are working on improving the whole ML STAC ecosystem, so I assume that would be part of their work.
Yeah, I think that came out of a in-person conversation with @m-mohr (sorry for throwing you under the bus Matthias) ...
I believe I was in a different sub-group (core spec, not Python ecosystem). :-P @gadomski
Correct, we are working on it! So far I've been primarily focused on the ML Model extension, and Francis has been supporting this and improving ML AOI. We've discussed tackling a review of the STAC Label extension next and providing updated examples. Others in the #ml-stac slack channel have expressed interest in contributing, providing feedback, and building with the extensions. In particular I think @earthpulse is using it in an image annotation tool and is looking to continue using it and the ML Model extension.
I'd be happy to be a co-owner with Francis for the STAC Label extension stac-extensions/label#16.
Correct! At @earthpulse we have worked on a labeling tool that is entirely based on this extension, so it is core for us and we are very interested in helping to keep the extension updated and functional. The tool, in case it is of interest to anyone: https://pypi.org/project/scaneo/
Replying to #1249 (comment) here so not to resurrect an already-closed issue.
There is some fear in the community that this extension is deprecated, so it would be good to talk about this topic.
The extension itself (https://github.com/stac-extensions/label) is not deprecated, and can still be used via Python w/ direct dictionary access, e.g.:
item.stac_extensions.append("https://stac-extensions.github.io/label/v1.0.1/schema.json")
item.properties["label:properties"] = ["foo", "bar"]
...
The pystac implementation of the label extension was deprecated during the extension update efforts at the latest STAC sprint (#1228). During that sprint, it was determined:
- There would be non-trivial effort to bring the label extension up-to-date
- The extension itself (https://github.com/stac-extensions/label) has some fundamental issues, e.g. it tries to handle both raters and vectors and some of its functionality overlaps with the classification extension (https://github.com/stac-extensions/classification/)
- The extension itself is unmaintained: stac-extensions/label#16
For all these reasons, the extension was deprecated, but the extension is still present in pystac and can be used. There's still quite a few issues in the pystac v2.0 milestone (https://github.com/stac-utils/pystac/milestone/28) and at present there's no one actively working significant feature improvements on pystac that I know of, all to say that removal is probably far off in the future.
As for a path forward to un-deprecating label, my recommended course of action would be to:
- Resolve the significant issues w/ label, probably including
- Once those are resolved, update the pystac extension to conform to the new label extension release and un-deprecate
From what I can see in issues linked in #1313 (comment), the only ambiguity is the wording of the GeoJSON requirement in Assets that does not make sense when label:type = raster
is specified, since a GeoTiff, COG or equivalent data matrix is expected instead. That can be changed in the spec's README without impact with how it is implemented in pystac
.
The biggest overlap (by name only really) between the extensions is label:classes
and classification:classes
. However, they are not actually equivalent. Property classification:classes
explicitly refers to "Classes stored in raster or bands" while label:classes
can refer to either rasters or geometries. When labels represent classifications as a whole over an area (i.e.: label:type = vector
) rather than pixel-wise classes (i.e.: label:type = raster
), this distinction is very useful and only possible via label
. Furthermore, label:classes
has the capability to list multiple value strings/numbers for a given class, while classification:classes
enforces a single integer. The label:classes.classes
array is useful for representing all possible values of a given class, which could be the integer pixel value, but also a class or category index (e.g.: the class and super-category IDs in COCO dataset, which are not continuous indices), or even the literal strings of those classes/categories. It can be used to define more annotations groups as well, which can be extremely useful when a taxonomy of classes gets in the few-hundreds-classes range, or when combining multiple dataset sources that use different class name/index conventions. Depending on how the annotation was performed, all these combinations of integer/string representations could be valid.
Another clear distinction is that label:properties
can be used to provide multiple classification properties over a given AOI. For example, if a lake was annotated with label:properties: ["CLASS", "CATEGORY", "WATER_QUALITY"]
, it could indicate that is it CLASS: lake
, CATEGORY: water-body
and WATER_QUALITY: 0.95
at the same time. To my knowledge, there would be no way to do this with classification
extension beside using many Assets as pixel-wise masks for each combination of those respective labels and classes. That would be more complex to interpret, and much bigger from a data storage point of view than a single GeoJSON that combines everything.
The classificiation
extension also focuses on the pixel values themselves, but not really what they are used for. The label
extension has an additional use which is to describe the annotation campaign itself. Details can be provided whether the labels were generated manually or an automated process (combined with https://github.com/stac-extensions/processing and https://github.com/crim-ca/dlm-extension, that can help with data lineage of derived products), statistics about them, and intended tasks.
Personally, I do not think that having distinct extensions for label:type = vector
or label:type = raster
would help. If anything, that could bring more confusion since many fields would be duplicated (just like there is confusion about label:classes
and classification:classes
although they represent different concepts). I believe it is only a matter of making sure the examples corresponds to the specification text (to avoid the ambiguity they create), just like stac-extensions/label#12 describes, and adding more examples with explanations how to combine label
with raster
, eo
, classification
extensions for specific use cases just like https://github.com/stac-extensions/classification/blob/main/examples/item-classes-maxar.json does.
Sharing some screenshots from a partner's notebook that uses pystac.extensions.label
and that performs validation successfully against the resulting STAC Item. From an implementation's point of view, everything seems to behave as intended. Therefore, I am still not sure where the maintenance issue is besides fixing the specification's README.
Thanks for outlining all the fixes for the extension issues I mentioned. If you can put that information over on the label extension itself (https://github.com/stac-extensions/label), that will help us clean up the actual extension โ this issue is just for the Python implementation of that extension. I'm happy to help move along those issues over on the label extension repo, but as I'm not actively using the extension for my own work I don't think I would be an appropriate owner.
From an implementation's point of view, everything seems to behave as intended.
Sounds good โ yeah, once the label extension's issues and ownership is cleared up, then it shouldn't be a problem to un-deprecate. There's still the matter of bringing the extension up to the latest best-practices mentioned in #1228, which shouldn't be a large lift but would need someone to step up and do that.