openclimatefix/Satip

Create dataset of UK-cropped satellite data from Europe dataset

devsjc opened this issue · 5 comments

Summary

Currently there exists a ~40Tb satellite image dataset on GCP (and on Leonardo). For ease of ML training, having a more managably-sized ~100Gb dataset that is purely UK image data would be beneficial. As such, we want to read in that existing dataset, crop the images down so they cover the UK alone, and write it to a new dataset.

Data structure

The dataset in GCP is stored in the bucket solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/v4.

The sattelite dataset consists of several years of data. This is a grid of chunks, each chunk containing 12 5-minute timesteps making up an hours' worth of imagery.

The bounds used to specify the UK in Satip are "UK": (-16, 45, 10, 62).

Method (Work in progress)

  1. Pull and uncompress current data, x timesteps at a time
  2. Copy/save metadata to avoid loss
  3. Extract images from chunks

Known gotchas

  • XArray will often delete zarr attribute files when writing new data: ensure to copy them explicitly into the new dataset
  • Will require decoding via OCF's blosc2 py library

You might want to rechunk the dataset as well, primarily in the x and y dims to better match the spatial extant.

I seem to recall that the images for this dataset were chunked using a 4x4 grid? If x and y are only split into 4 respectively on the large image dataset, and with these cropped images expected to be ~100x smaller, won't one entire cropped image be significanly less than what was previously in a x/y chunk, and hence we might not even need to chunk x/y?

Forgive me if/as my lack of understanding renders this question nonsensical...!

Yeah, I agree! But you might have to explicitly rechunk the data to that size

@devsjc Is this complete now? I.e: code to do this merged?

This could be linked to #180