nasaharvest/cropharvest

Question of data format explanation

LL0912 opened this issue · 1 comments

Hello, I am trying to use the dataset. I have downloaded the dataset from Zenodo. However, I found that there is no explanation of the data format, such as the meaning of the name of each file in the"features" and dictionary's keys in the "labels.geojson" . I can only guess the meaning by codes. How can I get the official explanation of the dataset including the filename and so on. Can you help me?

Hi there!

Apologies for the delayed reply. I'll add this to the main README but in the meantime:

labels.geojson

>>> import geopandas
>>> labels = geopandas.read_file("labels.geojson")
>>> labels.columns
Index(['harvest_date', 'planting_date', 'label', 'classification_label',
       'index', 'is_crop', 'lat', 'lon', 'dataset', 'collection_date',
       'export_end_date', 'is_test', 'geometry'],
      dtype='object')

There are two types of columns; RequiredColumns which must be filled for all rows, and NullableColumns, which can have null values (see here).

Required Columns
  • index - the index of the row
  • is_crop - a boolean indicating whether or not the point being described contains cropland or not (at the date described by export_end_date
  • lat - the latitude of the point
  • lon - the longitude of the point
  • dataset - the dataset which the point comes from
  • collection_date - the date at which the point was collected
  • export_end_date - we collect a year of data for each point - this value defines the last month for which data is exported (and therefore the entire timeseries, since we will collect data for a year up to that point).
  • geometry - the geometry of the point. This may be a polygon (in which case lat/lon will be the central point of that field) or a point
  • is_test - a boolean indicating whether or not the point is part of the test data
Nullable columns
  • harvest_date - the harvest date of the crop described at the lat/lon
  • planting_date - the planting date of the crop described at the lat/lon
  • label - the label - this will be the higher level agricultural land cover label describing the land use at the lat/lon for the given export_end_date
  • classification_label - the higher level classification of label, defined by the FAO's indicative crop classification (i.e. if a row has a label="maize", then it would have classification_label="cereals"

features

All features have the following naming convention: {index}_{dataset}.h5 - where these two values are defined above. So each feature is associated with a row in the labels.geojson.

We are currently in the process of changing this convention so that names are instead in a f"min_lat={min_lat}_min_lon={min_lon}_max_lat={max_lat}_max_lon={max_lon}_dates={start_date}_{end_date}_all" format.

Let me know if I can provide any further clarifications!