Data-level annotations
benjelloun opened this issue · 2 comments
Add a mechanism to Croissant to define data-level annotations. Annotations are a general mechanism to attach additional information to other pieces of data. We plan to use annotations for a number of use cases, including:
- statistics
- labels (textual or otherwise)
- provenance (including human annotator information)
- ...
Strawman proposal
Make annotation a first class property, so that we can clearly represent the fact that some contents of a RecordSet are annotations. You can think of an annotation as a special kind of field that annotates its container.
Here is an example of what a field-level annotation looks like:
{"@type": "cr:RecordSet", "@id": "images",
"field": [
{ "@type": "cr:Field", "@id": "images/image", ... ,
"annotation": {
"@type": "cr:Field", "@id": "images/label",
"dataType": ["sc:Text", "cr:Label"]
}
}
]
}
In this example, the annotation "images/label" applies to the field "images/image".
Annotations can also appear at the level of a RecordSet. A RecordSet level annotation applies to the entire record. For example:
{
"@type": "cr:RecordSet",
"@id": "movies",
"field": [
{ "@type": "cr:Field", "@id": "movies/movie_id", ...},
{ "@type": "cr:Field", "@id": "movies/title", ...},
{ "@type": "cr:Field", "@id": "movies/genre", ...}
],
"annotation" : {
"@type": "cr:Field", "@id": "movies/ratings",
subField: [
{ "@type": "cr:Field", "@id": "movies/ratings/user_id", ...},
{ "@type": "cr:Field", "@id": "movies/ratings/rating", ...},
]
}
}
In this example, ratings is a structured annotation that contains a user_id and a rating.
Some examples of netcdf file for hierarchical data annotation -
- https://huggingface.co/Prithvi-WxC/prithvi.wxc.2300m.v1/blob/main/climatology/climate_surface_doy001_hour00.nc
- Analysis Ready Data and Cloud-optimized Data format - Zarr (in crude sense - advanced version of NetCDF) ( https://github.com/google-research/arco-era5?tab=readme-ov-file#raw-cloud-optimized-data) - This was referred in one of the NeurIPS review comment.