pfmc-assessments/nwfscSurvey

separator choice for new column names based off of input column name

kellijohnson-NOAA opened this issue · 4 comments

I am trying to decide on a "universal" separator for column names that will be created by functions. For example, if we create a new column of what bin a length belongs in and the length column is Length_cm, should the resulting new column be Length_cm_bin or should a different more unique separator be used for between cm and bin? Other ideas are .., !, etc. At first I thought we should introduce a less-used separator so it would be clear what columns are being created by dplyr-based functions in {nwfscSurvey} and to be able to differentiate them from common column names like Length_cm. But, after creating a few functions I am wondering if we should just use _ to be more consistent? 🤦 🤷

FWIW, the Data Warehouse uses "_" for the most part. The other convention is with groups of related fields. For instance, all taxonomy fields are separated by $ as in:

best_available_taxonomy_dim$class_30
best_available_taxonomy_dim$family_50
best_available_taxonomy_dim$genus_70

Using this convention and applying it to a hypothetical set of length bins, one might use:

Length_cm
Length_cm$bin_1_cm
Length_cm$bin_5_cm
Length_cm$bin_10_cm
...
Given the length unit is already specified, you could also remove the "_cm" suffix.

I suspect the $ convention is a best practice for Data Warehouses and perhaps REST APIs, though I'm not sure. 🤷🏼‍♂️

This is a good question and one I struggle with as well. I think there are two parts here. First, in the long-term I would like to update the nwfscSurvey package to no longer modify data warehouse columns from lower case to upper case (e.g., modified "Length_cm" versus warehouse "length_cm"). If I ever get to this task, that would potentially eliminate the ability of separating original and added columns based on upper versus lower case. Second, I prefer to retain the same separator approach between the original data and added columns which keeps people from having to remember the specific separator approach for a column (e.g., length_bin_cm versus length.bin.cm when the original column is length_cm). For consistency, even if it reduces the ability to identify added columns would be to use the "_" separator. In this example for an added length bin column I think "length_cm_bin" would be fine.

+1 to Chantel's comments.
I like underscores and I don't see it as that important to provide users a clear separation between original data warehouse headers vs new ones added by the package. The data warehouse is publicly available and the package code is open source, and the documentation could help with this distinction if needed.

Thanks @Curt-Whitmire-NOAA, @chantelwetzel-noaa, and @iantaylor-NOAA I will go with "_" as the default but allow users to change it if they want. Second, the background information that @Curt-Whitmire-NOAA provided helps clarify so many things. I never understood the use of $ in the names of columns in the data warehouse prior to now.