Create dataset of UK-cropped satellite data from Europe dataset

Question

Create dataset of UK-cropped satellite data from Europe dataset

devsjc opened this issue 2 years ago · 5 comments

Summary

Currently there exists a ~40Tb satellite image dataset on GCP (and on Leonardo). For ease of ML training, having a more managably-sized ~100Gb dataset that is purely UK image data would be beneficial. As such, we want to read in that existing dataset, crop the images down so they cover the UK alone, and write it to a new dataset.

Data structure

The dataset in GCP is stored in the bucket solar-pv-nowcasting-data/satellite/EUMETSAT/SEVIRI_RSS/v4.

The sattelite dataset consists of several years of data. This is a grid of chunks, each chunk containing 12 5-minute timesteps making up an hours' worth of imagery.

The bounds used to specify the UK in Satip are "UK": (-16, 45, 10, 62).

Method (Work in progress)

Pull and uncompress current data, x timesteps at a time
Copy/save metadata to avoid loss
Extract images from chunks

Known gotchas

XArray will often delete zarr attribute files when writing new data: ensure to copy them explicitly into the new dataset
Will require decoding via OCF's blosc2 py library

Answer 1 · 2023-02-01T13:51:58.000Z

You might want to rechunk the dataset as well, primarily in the x and y dims to better match the spatial extant.

Answer 2 · 2023-02-01T13:58:48.000Z

I seem to recall that the images for this dataset were chunked using a 4x4 grid? If x and y are only split into 4 respectively on the large image dataset, and with these cropped images expected to be ~100x smaller, won't one entire cropped image be significanly less than what was previously in a x/y chunk, and hence we might not even need to chunk x/y?

Forgive me if/as my lack of understanding renders this question nonsensical...!

Answer 3 · 2023-02-01T13:59:48.000Z

Yeah, I agree! But you might have to explicitly rechunk the data to that size

Answer 4 · 2024-01-14T09:04:37.000Z

@devsjc Is this complete now? I.e: code to do this merged?

Answer 5 · 2024-01-26T11:49:42.000Z

This could be linked to #180