[JOSS REVIEW] Cannot run P.cleanit()

Comments are for openjournals/joss-reviews#3666 (comment).

Following 2 - Profiles extraction, unsupervised sand labelling and cleaning.ipynb, I get a MergeError: Merge keys are not unique in right dataset; not a one-to-one merge error when running P.cleanit.

sandpyper/examples/2 - Profiles extraction, unsupervised sand labelling and cleaning.ipynb

Lines 3396 to 3399 in ce542c6

    
           "P.cleanit(l_dicts=l_dicts,\n", 
        
           "          watermasks_path=watermasks_path,\n", 
        
           "          shoremasks_path=shoremasks_path,\n", 
        
           "          label_corrections_path=label_corrections_path)"

Looks like the validation when merging the dataframe is failing, potentially because of multiple matches in the dataframe?

sandpyper/sandpyper/common.py

Lines 2268 to 2269 in ce542c6

    
           classed_df_finetuned=to_clean_classified.merge(right=to_update_finetune.loc[:,['point_id','finetuned_label']], # Left Join 
        
                                        how='left', validate='one_to_one')

I saved the to_clean_classified and to_update_finetune dataframes before the error is thrown:

Thx @chrisleaman for reporting this.
By following what I wrote here you should be able to make it work.

But, I need to figur out this error and why I get duplicates but I think it is related to the fact that some labels_k are present in the polygons used to correct the data but not in the actual profile data.

Ok I found out why some points got duplicated when the number of k was 10 for all the surveys to be clustered with KMeans, it was a tricky one but luckily you spot it.
Basically it was an overlap issue with label correction polygons. Let me explain you.

Label correction polygons can overlap because they are only operating on a survey basis (raw date and location) and only target those points within their boundaries that have the label_k attribute equals to the label correction polygon attribute target_label_k.
However, if polygons targeting the same label_k overlap and in their overlap area they include points with their target_label_k, the new class MUST BE THE SAME, otherwise it doesn't make sense as nobody knows that point in that shared area what new class should take!

What happened was that I digitised label correction polygons by looking at their labels_k as per returned by KMeans using the correct number of k ( the opt_k dictionary). My polygons overlapped of course because no overlapping areas contained points with same label_k but different new class. But when running the wrong number of k, the label_k of the points changed and the polygons suddently became irrelevant! Then, in one survey (leo_20190731), there was one point with label_k=6 which happened to be right in a overlap area of two label correction polygons with target_label_k=6 but with 2 different new classes attribute (sand and veg) and this caused the duplication of these points and the non-uniqueness of their IDs and the impossibility to merge etc etc until the Error.
See below image, that point caused the error as it is sitting right in the overlap of the yellow and purple polygons.

When using the correct number of k in each survey, the point above is still there in the overlap area but its label_k is not 6, therefore these polygons do not operate on it and no error arise.
Of course, this wouldn't happen if the user do not rerun KMeans with different random seed or different number of k per survey!
I hope I explained it well.

Here is what I did to correct this error.
Instead of simply write a warning and relying on the users to not to overlap label correction polygons targeting the same label_k but assigning different new classes, I decided to write a function that check for inconsistencies after the polygon creation, together with clearly stating this rule in the DOC. This would allow users to be more flexible because when the number of points is large (if for example you sample a point every 10cm along a transect), points can get dense and is quicker to just overlap.

Here is the function that takes care of that and checks the aforementioned rule:

def check_overlaps_poly_label(label_corrections, profiles,crs):
    """
    Function to check wether overlapping areas of label correction polygons targeting the same label_k in the same surveys but assigning different new classes do not contain points that would be affected by those polygons.
    Args:
        label_corrections (gpd.GeoDataFrame): GeodataFrame of the label correction polygons.
        profiles (gpd.GeoDataFrame): Geodataframe of the extracted elevation profiles.
        crs (dict, int): Either an EPSG code (int) or a dictionary. If dictionary, it must store location codes as keys and crs information as values, in dictionary form (example: {'init' :'epsg:4326'}).

    """
    for loc in label_corrections.location.unique():
        for raw_date in label_corrections.query(f"location=='{loc}'").raw_date.unique():
            for target_label_k in label_corrections.query(f"location=='{loc}' and raw_date=={raw_date}").target_label_k.unique():

                date_labelk_subset=label_corrections.query(f"location=='{loc}' and raw_date=={raw_date} and target_label_k=={int(target_label_k)}")

                # if more than one polygons target the same label k, check if they overlap
                if len(date_labelk_subset)>1:

                    # check if there are any one of them that overlaps
                    for i,z in comb(range(len(date_labelk_subset)),2):
                        intersection_gdf = overlay(date_labelk_subset.iloc[[i]], date_labelk_subset.iloc[[z]], how='intersection')

                        if not intersection_gdf.empty:

                           # check if the overlapping polygons have assigns different new_classes
                            if any(intersection_gdf.new_class_1 != intersection_gdf.new_class_2):

                                # if overlap areas assign different classes, check if this area contains points.
                                # if contains points, raise an error as it does not make sense and the polygons must be corrected
                                # by the user

                                pts=profiles.query(f"location=='{loc}' and raw_date=={raw_date}")

                                if isinstance(pts,pd.DataFrame):

                                    pts['coordinates']=pts.coordinates.apply(coords_to_points)
                                    if isinstance(crs, dict):
                                        pts_gdf=gpd.GeoDataFrame(pts, geometry='coordinates', crs=crs[loc])
                                    elif isinstance(crs, int):
                                        crs_adhoc={'init': f'epsg:{crs}'}
                                        pts_gdf=gpd.GeoDataFrame(pts, geometry='coordinates', crs=crs_adhoc)

                                elif isinstance(pts,gpd.GeoDataFrame):
                                    pts_gdf=pts
                                else:
                                    raise ValueError(f"profiles must be either a Pandas DataFrame or Geopdandas GeoDataFrame. Found {type(profiles)} type instead.")

                                fully_contains = [intersection_gdf.geometry.contains(mask_geom)[0] for mask_geom in pts_gdf.geometry]

                                if True in fully_contains:
                                    idx_true=[i for i, x in enumerate(fully_contains) if x]
                                    raise ValueError(f"There are {len(intersection_gdf)} points in the overlap area of two label correction polygons (location: {loc}, raw_date: {raw_date}, target_label_k = {target_label_k}) which assign two different classes: {intersection_gdf.loc[:,'new_class_1'][0], intersection_gdf.loc[:,'new_class_2'][0]}. This doesn't make sense, please correct your label correction polygons. You can have overlapping polygons which act on the same target label k, but if they overlap points with such target_label_k, then they MUST assign the same new class.")

    print("Check label correction polygons overlap inconsistencies terminated successfully")

Now users can create label correction polygons which target the same label k without stressing too much on being super precise and not overlapping. If an issue of overlapping area of 2 polygons which target the same label k but assign different class should contain a point with such target k, this function will return a ValueError and point to what location, what date and what target_label_k created the error, allowing the user to quickly return to the GIS and correct the polygons. The message is like this one:

ValueError: There are 1 points in the overlap area of two label correction polygons (location: leo, raw_date: 20190731, target_label_k = 6) which assign two different classes: ('sand', 'veg'). This doesn't make sense, please correct your label correction polygons. You can have overlapping polygons which act on the same target label k, but if they overlap points with such target_label_k, then they MUST assign the same new class.

I am now updating the doc with this warning, then I will place this check directly in the cleanit method.

Looks like a tricky error to find! Glad you managed to sort it out - a warning to the user, like you suggested, should work well.

Triki but important. Now I am struggling to run the function in the github environment while testing, which works perfectly local. As soon as I find what is causing this error I update you here. Cheers

If you agree a warning is enough (for now), I added a big note in the DOC, in the label correction file section of the data cleaning chapter.

However, the function I created works perfectly in my local Jupyter notebook setup, but for some reasons it doesn't work when running in the GitHub VM during testing the package.
Therefore I decided to stop trying to make it work, adopt your suggestion of the warning, but I added it as a roadmap milestone, because even if it might not be crucial at the moment, I think an automatic check is beneficial.

So I will take the time to make this check work in the near future.

Hi @npucino, I agree with your approach - a warning is fine for now if you're struggling to get it to work on Github 👍

Hi thanks @chrisleaman . I see that you fetched the commits related to the issues, I am sorry I didn't do it myself.
Now I was trying to see the updated part in the DOC but it seems that GitHub action never rendered the new version. I look into it now!

Here is the commit when I added the Note message: 12e5965

Cheers

	"P.cleanit(l_dicts=l_dicts,\n",
	" watermasks_path=watermasks_path,\n",
	" shoremasks_path=shoremasks_path,\n",
	" label_corrections_path=label_corrections_path)"

	classed_df_finetuned=to_clean_classified.merge(right=to_update_finetune.loc[:,['point_id','finetuned_label']], # Left Join
	how='left', validate='one_to_one')