nasaharvest/cropharvest

Newbie question: How to open satellite image associated to the dataset?

kolrocket opened this issue · 2 comments

Hello! I'm relatively new to the Python language and to EO analysis. I found CropHarvest an interesting package to work with. However, I'm having some issues in understanding where things are.

I know datalabels.geojson is where datasets are and by subsetting through ['lem-brazil'] I can access data specificaly from this source and can plot the points where each sample is through plot. I would like to know how to plot the associated (NDVI) satellite image from each data sample, like Figure 3 from the CropHarvest paper (CropHarvest: a global satellite dataset for crop type
classification). Is it possible or did I misunderstand and it is not possible to work with the satellite image to work with time series images? Also, for example, meteorological data available in the same function? My guess is that everything is in cropharvest.datasets but I'm struggling to find it out.

Thank you very much.

Hi @kolrocket ,

Thanks for your interest in the dataset 😃 CropHarvest is a dataset of pixel timeseries, not images. This means that unlike Figure 3 (which is just illustrative), we are measuring (e.g.) NDVI for a single pixel over time (as opposed to NDVI over a large area for a single timestep).

If you did want to retrieve NDVI for a particular row in the labels geosjon, here is how you might go about doing it:

Firstly, I will load the geosjon. I'll do it through a CropHarvestLabels object, which is just a wrapper around the geojson but provides some nice utility functions.

>>> from cropharvest.datasets import CropHarvestLabels
>>>
>>> labels = CropHarvestLabels("cropharvest/data")
>>> labels_geojson = labels.as_geojson()
>>> labels_geojson.head()
  harvest_date planting_date  ... is_test                                           geometry
0         None          None  ...   False  POLYGON ((37.08252 10.71274, 37.08348 10.71291...
1         None          None  ...   False  POLYGON ((37.08721 10.72398, 37.08714 10.72429...
2         None          None  ...   False  POLYGON ((37.08498 10.71371, 37.08481 10.71393...
3         None          None  ...   False  POLYGON ((37.09021 10.71320, 37.09014 10.71341...
4         None          None  ...   False  POLYGON ((37.08307 10.72160, 37.08281 10.72197...

[5 rows x 13 columns]

Now, I can use the labels object to retrieve the filepath of the processed satellite data for each row in the labels geojson:

>>> path_to_file = labels._path_from_row(labels_geojson.iloc[0])

This processed satellite data is stored as h5py files, so I can load it up as follows:

>>> import h5py
>>> h5py_file = h5py.File(path_to_file, "r")
>>> x = h5py_file.get("array")[:]
>>> x.shape
(12, 18)

The shape of x represents 12 timesteps and 18 bands. To retrieve the band I am interested in:

>>> from cropharvest.bands import BANDS
>>> x[:, BANDS.index("NDVI")]
array([0.28992072, 0.28838343, 0.26833579, 0.22577633, 0.27138986,
       0.06584114, 0.498998  , 0.50147203, 0.50437743, 0.44326343,
       0.33735849, 0.28375967])

These are 12 NDVI values, corresponding to the 12 months captured in this timeseries. To find out exactly which month each timestep represents, I can do

>>> labels_geojson.iloc[0].export_end_date
'2021-02-01T00:00:00'

Wich tells me that the last timestep represents January 2021. I can work backwards from there.

I am not sure if this satisfies your use case - once again, this is a dataset of pixel timeseries, not images. In any case, I hope it helps!

Oh, sorry for the confusion! It is indeed pixel time series what I am working with :).
Thank you very much, it will help me a lot.