nasaharvest/cropharvest

Generalizing exporter

ivanzvonkov opened this issue · 3 comments

Exporting labels

  • (1) labels should be made input to the export_for_labels() function
  • Why: crop-mask needs the ability to use the exporter for files outside the public labels.geojson
  • (2) labels should be checked for "start_date" and "end_date" and used if they are found
  • Why: The way that start and end date are computed may vary by dataset, storing this information in the labels prior to exporting leaves the exporter agnostic to this
  • (3) The exporter class should have dest_bucket parameter where user can specify the destination GCP bucket
  • Why: Storing tifs on Google Cloud is easier than Google Drive
  • (4) Exported tifs should have a canonical name
  • Suggested: f"min_lat={min_lat}_min_lon={min_lon}_max_lat={max_lat}_max_lon={max_lon}_dates={start_date}_{end_date}_all" (where the all indicates all bands are being exported, not just Sentinel 2)
  • Why: Making the tifs agnostic to the datasets they are derived from makes it possible to change the underlying dataset without having to reexport all tifs (example use case: partially labeled CEO project csv -> fully labeled CEO project csv)
  • (5) export_for_labels should have check_gcp option which would check if the tif about to be exported already exists on Google Cloud
  • Why: There's no need to reexport a file already exported before. (This is like checkpoint but for cloud storage)
  • (6) export_for_labels should have check_ee option which would check if the tif about to be exported is currently in the earth engine queue.
  • Why: No need to export tifs already in the Earth Engine queue

Exporting region

I can work on 3,5,6

I'll work on removing the concept of a data_folder, as per #56 (comment)

I'll work on removing the concept of a data_folder, as per #56 (comment)

Here's the "label management" code in crop-mask for reference: https://github.com/nasaharvest/crop-mask/blob/96b56c50e238e836cab00699e522bb28812469a0/src/ETL/dataset.py#L47