npucino/sandpyper

[JOSS REVIEW] Example notebook minor suggestions

Closed this issue · 3 comments

Hi @npucino,

Great work in creating these example notebooks, it was pretty easy for me to work through all three notebooks and understand how everything in the package works.

My comments below are mainly minor suggestions about providing some additional explaination for anyone using these notebooks. Comments are for openjournals/joss-reviews#3666 (comment). I've created two other issues for some trickier problems (#7, #8).

1 - Introduction and data preparation.ipynb

  1. Can you define the term "LoD" before you use it here? I had to scroll down a bit before I got the definition.

    "* __ProfileSet__ class: manages the elevation and colour data extraction and LoD analysis\n",

  2. When talking about the demo data, I think it should be more clear that these are the inputs that sandpyper needs (and not the output that sandpyper generates). It would also be helpful to briefly describe how each input is typically created, i.e. orthos and dems are stitched together using photogrammetry software (Pix4D), while transect and mask files are manually created in a GIS software (QGis, ARCGIS). You've provided examples later in the notebook, it'd be good just add a summary sentence or two here as well.

    "The __demo_data.rar__ archive includes all the data you need to get you started and understand how Sandpyper works. This include:\n",

  3. We need to change the path of a lot of path varibles, depending where we've saved the test_data. I suggest defining the parent path first and constructing the other paths based on this. Something like:

from pathlib import Path

test_data_folder = 'C:\Users\Chris\Desktop\sandpyper\examples\test_data'
ortho_path=Path(test_data_folder,"\orthos_1m\leo_20180920_ortho_resampled_1m.tif")
watermasks_path=Path(test_data_folder,"\clean\watermasks.gpkg")
shoremasks_path=Path(test_data_folder,"\clean\shoremasks.gpkg")
label_corr_path=Path(test_data_folder,"\clean\label_corrections.gpkg")
transect_path=Path(test_data_folder,"\transects\leo_transects.gpkg")
transect_lod_path=Path(test_data_folder,"\lod_transects\leo_lod_transects.gpkg")

"ortho_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\orthos_1m\\leo_20180920_ortho_resampled_1m.tif\"\n",
"watermasks_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\clean\\watermasks.gpkg\"\n",
"shoremasks_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\clean\\shoremasks.gpkg\"\n",
"label_corr_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\clean\\label_corrections.gpkg\"\n",
"transect_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\transects\\leo_transects.gpkg\"\n",
"transect_lod_path=r\"C:\\my_packages\\sandpyper\\examples\\test_data\\lod_transects\\leo_lod_transects.gpkg\"\n",

  1. Can you double check the leo features in the label corrections file (examples/test_data/clean/label_corrections.gpkg) are in the correct location? Looks like some CRS mistake when I try load it in QGIS. The mar features in the correct location though.

    "<font size=\"5\"><center> <b>Label correction file</b></center></font>\n",

  2. I think an additional sentence explaining why points are being classified here would be good. Also update the reference to example notebook 2.

    "Label correction files (geopackages or shapefiles) are digitised over points which have cluster labels (assigned by KMeans algorithm) which we are not totally happy with. The attribute __target_label_k__ specifies which label k will be affected by the correction, leaving untouched all other points falling within the polygon but having different label k. This is useful to fine-tune the point classification, as it is covered in the notebook __AAAAAAA__. If you want to apply the correction to all the points, regardless of the label k, just assign 999 to this field.\n",

2 - Profiles extraction, unsupervised sand labelling and cleaning.ipynb

  1. Again, easier to use the pathlib.Path approach here to define the parent folder rather than changing each variable.

    "dirNameDSM=r'C:\\my_packages\\sandpyper\\examples\\test_data\\dsm_1m'\n",
    "dirNameOrtho=r'C:\\my_packages\\sandpyper\\examples\\test_data\\orthos_1m'\n",
    "dirNameTrans=r'C:\\my_packages\\sandpyper\\examples\\test_data\\transects'\n",

  2. Add the command to export the dataframe to csv: P.profiles.to_csv('profiles.csv')

3 - Profile dynamics.ipynb

(I couldn't link to line numbers in this file - perhaps the notebook is too big for github?)

  1. Section: "Plot: LoD normality check": The normality check plots show that the distributions fail the normality checks. Can you talk about what (if any) implications this has?

  2. Section: "Plot: multi-scale MECS and volumetrics": The plots for D.plot_transect_mecs(location='mar',tr_id=10) and D.plot_transect_mecs(location='mar', lod=D.lod_df, tr_id=10) seem fairly similar (though I see some variation in the shaded area). Could you add a couple of sentences discussing the difference between using LoD and not?

Thanks @chrisleaman for this review! Very useful and constructive suggestions that I will implement very soon.

For now I want to just reply this:

Can you double check the leo features in the label corrections file (examples/test_data/clean/label_corrections.gpkg) are in the correct location? Looks like some CRS mistake when I try load it in QGIS. The mar features in the correct location though.

Unfortunately, storing all locations (multiple locations and multiple CRSs) into one single Geopandas.Geodataframe allow only 1 CRS to use. Thus, I digitised all correction polygons with EPSG 32754, which is the correct one for Marengo but the wrong one for St. Leonards (EPSG 32755). Unfortunately, we would need either:

  1. one geodataframe per location with the actual location CRS
  2. a geodataframe which stores multiple CRSs, say, some rows with EPSG 32754 and others EPSG 32755, which is not possible to my knowledge.

The CRS dictionary takes care of the imagery coordinates and all overlay operations that need matching CRSs, but the single CRS geodataframe issue is a pitfall of current Geopandas.
Unless I am not aware of a multi CRS geodataframe variant!

I must make clear this somewhere, very important, thanks for reporting this!

Here there are the suggestions implemented, currently in the conda_only branch, where I am concentrating all JOSS modifications.

Sorry for the rudimental images snapshots, I don't know how to reference notebook lines as links and couldn't find a solution in 5 minutes of googling.

1.1 Earlier definition of LoD

I added its definition straight directly in the introduction.

image

1.2 Better definition of input data

I added a better explaination of each input, making clear that these are inputs and not what Sandpyper produces.

image

1.3 & 2.1 Implementation of pathlib Path to manage paths

I used the strategy you suggested and it is way better.

image

1.4 CRS misplacement

Explained here: #9 (comment)

1.5 additional explanation of point classification importance and updated link to notebook 2

I wrote an extra paragraph where I introduce the challenges and the why I coded this way (for now).
And updated the link to NB-2.

image

2.2 save to .csv command

Added.
image

3.1 Normality check importance and consequences explanation

I better explained why it is important to test the normality of the error distributions and the consequences on the decision of most suitable error metric, with reference to one of the most important literature that best explain this concept for this type of data.

image

3.2 Plot transect mecs with and without LoD better explanation

I added a few words on the importance of using LoD despite the difference being small.

image
image

Cheers

Hi @npucino, great work with these changes! I really like the additional guidance and discussion you've provided, especially in notebook 3. It'll be especially helpful for those that aren't familiar with these LoD concepts (like me!).

This issue was addressed with the following commits: