Inconsistent dims of the one-hot encoded labels in the CoastTrain dataset

Question

Inconsistent dims of the one-hot encoded labels in the CoastTrain dataset

FlorisCalkoen opened this issue 2 years ago · 9 comments

Hi all, many thanks for your great work, but unfortunately I experience some trouble reproducing the results because of an issue with the labels in the training data.

The "label" or mask dimension is presented a one-hot encoded vectors per px, but I find that these are not consistent among the different images, i.e., dimension of the label is (height, width, range(1,11). This causes the one-hot encoded labels to be inconsistent among multiple images. Please find an example below.

I downloaded the associated CoastSeg dataset (Wernette et al 2022) from https://cmgds.marine.usgs.gov/data-releases/datarelease/10.5066-P91NP87I/. Specifically I'm looking at the S2 data available at this url: https://cmgds.marine.usgs.gov/data-releases/media/2022/10.5066-P91NP87I/bdc18f4f38004538af974aa0540d468c/Sentinel2_11_001.zip

To get the data:

mkdir -p ~/tmp/coasttrain
wget https://cmgds.marine.usgs.gov/data-releases/media/2022/10.5066-P91NP87I/bdc18f4f38004538af974aa0540d468c/Sentinel2_11_001.zip -O ~/tmp/coasttrain/data.zip
cd ~/tmp/coasttrain
unzip data.zip
rm data.zip

import pathlib
import pandas as pd
import numpy as np

coasttrain_dir = pathlib.Path.home().joinpath("tmp", "coasttrain")
metadata_fp = coasttrain_dir.joinpath("Sentinel2_11_001.csv")

metadata = pd.read_csv(metadata_fp)

# take random sample fp from df
fp = coasttrain_dir.joinpath(
    metadata.sample(1)["images"].iloc[0].split("/")[1].rsplit(".", 1)[0] + ".npz"
)

# load data into python
data = np.load(fp)
# (height, width, label) - run this a few times to see the diffs in label dimension
print(data["label"].shape)  # e.g., (320, 184, 7) or (750, 956, 8) or (468, 134, 5)

In this case, 7, 8, and 5 refer to the length of the one-hot encoded label dimension.

Any chance you have an updated dataset with matching labels? Or how do you get the labels for the training data?

Answer 1 · 2022-10-28T13:16:18.000Z

Hi @FlorisCalkoen ,
thanks for providing a code snippet here, I was able to exactly reproduce your issue.

You are correct, the one-hot labels change size - this occurs because each image does not neccesary have all classes that a labeler could use (11 in this case). If you look at the .csv file in the .zip file, you will see some helpful columns for checking this:

classes_array and num_classes are the names and the number of classes available to the annotator for the image
classes_present_array and classes_present_integer lists the name and number of classes used by the annotator for that particular image.

This results in the varying size of the one-hot labels dimension for all the labels.npy files.

We note in the CoastTrain data summary (on the landing page) that the easiest way to wrangle the data into ML format is to use Doodler, and I recommend this approach. This will get all the labels correctly encoded.. I just verified that it works.

So, can you follow this workflow and see how it goes:

Set up doodler on your machine
activate the doodler conda environment
navigate back into the doodler directory
navigate into /utils/
run gen_images_and_labels.py, and select the folder of Coast train .npzs you want to process (~/tmp/CoastTrain)
Once the script is done, you should see several new folders in ~/tmp/CoastTrain. Two will be 'images' and 'labels', and they will be named correctly and the labels will be correclty encoded. You can look at the Overlays folder to check it out.
copy/paste those two folders into your Gym directory, and you will be good to go..

this is an 'official' way to get images and labels from Doodler (CoastTrain is from Doodler) -> gym..

https://doodleverse.github.io/dash_doodler/docs/tutorial-extras/next-steps

some examples of the overlay files, which are not used in gym but are good for visually checking the results:

Answer 2 · 2022-10-30T17:20:20.000Z

Ok, thank you for suggesting that workflow! But.. unfortunately I now have an issue with pydensecrf.. Have you experienced that earlier?

As a workaround I'll try to take some useful snippets from that gen_images_and_labels.py. Any chance you also have a snippet to associate the geographic coordinates from the metadata to the the numpy arrays? I.e., numpy arrays + metadata to rioxarray dataset.

Although the package issue should probably be a seperate issue on that other Doodler page, please find some info below. So the errors mostly refer to the package not being available on the conda channel. I tried to install it in several ways:

conda create --name dashdoodler python=3.8
conda install -c conda-forge pydensecrf cairo cairosvg scikit-learn scikit-image psutil dash flask-caching requests pandas matplotlib ipython tqdm  # PackagesNotFoundError: The following packages are not available from current channels:
  - pydensecrf

conda create --name dashdoodler pydensecrf # PackagesNotFoundError: The following packages are not available from current channels:
  - pydensecrf

conda create --name dashdoodler python=3.8
pip install pydensecrf

results in:

❯ pip install pydensecrf
Collecting pydensecrf
  Using cached pydensecrf-1.0rc3.tar.gz (1.0 MB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: pydensecrf
  Building wheel for pydensecrf (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [255 lines of output]
  
      fatal error: too many errors emitted, stopping now [-ferror-limit=]
      9 warnings and 20 errors generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pydensecrf

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Answer 3 · 2022-10-30T18:29:21.000Z

@ebgoldstein @dbuscombe-usgs another tiny thing i just notice:

The classes_array contains different arrays for the same dataset. See example below:

import pandas as pd


def tokenize(s):
    return [i.strip() for i in s.replace("'", "").split(",")]


coasttrain_dir = pathlib.Path.home().joinpath(
    "data", "src", "coasttrain", "Sentinel2_11_001"
)
metadata_fp = coasttrain_dir.joinpath("Sentinel2_11_001.csv")
metadata = pd.read_csv(metadata_fp)

# two different class labels present in dataset
print(len(metadata["classes_array"].unique())) ## 2
classes1, classes2 = metadata["classes_array"].unique()

# so there are 5 rows with other class labels than other 335
print(metadata.loc[metadata["classes_array"] == classes1].shape)
print(metadata.loc[metadata["classes_array"] ==classes2].shape)

# the classes that differ are:
cl1, cl2 = tokenize(classes1), tokenize(classes2)
print(list(set(cl1) - set(cl2)))
print(list(set(cl2) - set(cl1)))

## ['cloud', 'vegetated_surface', 'other_natural_terrain']
## ['terrestrial_vegetation', 'marsh_vegetation', 'other_bare_natural_terrain']

I think I just better drop those 5 rows right :)?

Answer 4 · 2022-10-31T12:32:49.000Z

Hi @FlorisCalkoen -

i moved the pydensecrf issue to the doodler repository: Doodleverse/dash_doodler#45... Can we continue that discussion in Doodler?
Yes I see those 5 examples mixed in with the 340 others:

the classes array for those 5:

'water', 'whitewater', 'sediment', 'other_bare_natural_terrain', 'marsh_vegetation', 'terrestrial_vegetation', 'agricultural', 'development', 'nodata', 'unusual', 'unknown'

and the typical classes array is:
'water', 'whitewater', 'sediment', 'other_natural_terrain', 'vegetated_surface', 'agricultural', 'development', 'cloud', 'nodata', 'unusual', 'unknown'

I think in the short term, yes @FlorisCalkoen , i recommend dropping those rows.
in the long term, @dbuscombe-usgs what do you think?

(Also, @dbuscombe-usgs , should we transfer this issue to the CoastTrain repository? https://github.com/CoastTrain/CoastTrain)

Answer 5 · 2022-10-31T12:37:13.000Z

Ok, thanks, I'll keep you posted when I continue with the coasttrain dataset. For now I'll remove that pip install output from my previous comment to keep this issue a bit more readable :)

Answer 6 · 2022-10-31T12:38:46.000Z

you can keep it here , it's totally fine.. (and might even be useful).. (but we will also be able to see it in via looking at past versions of the issue.. i.e., using the 'edited' dropdown)

Answer 7 · 2022-10-31T16:39:37.000Z

@FlorisCalkoen thanks for reporting the issue with the mismatch in labels for those 5 samples. I will take a look at this and see what is going on in the next few days

Answer 8 · 2023-02-27T13:59:44.000Z

@dbuscombe-usgs , can you try to 'Transefer' this issue to https://github.com/CoastTrain/CoastTrain? (i would do it, but i don;t have permissions in CT).. If that doesn't work, perhaps we could close this and open a new issue in CT that references this issue? Just to get it out of Gym..

Answer 9 · 2023-02-27T15:44:49.000Z

I can only transfer to other repos in this org. I can close and reopen in CT?