Recommend a site, sample and measure ID names
DougManuel opened this issue · 6 comments
Should ODM include recommended or proposed names for site, sample, and measure IDs? ODM data validation and transformation tools could also generate IDs if they are missing.
SiteID - Europe has a site naming convention. See here. It wasn't clear whether there are naming conventions in other countries. Currently, there is a wide range of approaches that wastewater surveillance programs use for naming sites.
SampleID - Recommended default is site_date_sample#
. sampleIDs should be simple. However, it was raised whether the id should have additional info or hints in the ID. It was recognized that sample IDs could get complex with concepts such as different sample types, children, pooling, etc.
MeasureReportID - Recommended default sampleID_date_index
.
For the sampleID, just want to clarify how best to do this since some siteIDs in Ontario hasa wide variety of character counts with/without underscores (could be 20+characters).
Should it be that the site portion is either the siteID could it just be a fixed 5 character abbreviation for the site i.e. for Ashbridge's Bay: ASHBR_20220116_01.
We can allow more characters for sampleID if we need or want.
For siteID, I've been wondering whether we want to an additional variable for an siteID abbreviation for purposes such as this.
We could just keep it simple and use the first 5 characters (excluding spaces) of the siteName, so it avoids needing to add in another field.
Looking ahead, it seems like it might be prescient to provide a guide/suggestion for other user-generated IDs as well. Building off our discussion, I've made some suggestions below. The IDs would use SnakeCase to avoid using excess symbols that might cause difficulties in processing as well. Another point is that these IDs should ideally be generated using pieces of metadata already present in the tables so that if they're absent it is easy to machine-generate them.
- Site ID: first 3 letters of
name
+ first 2 numbers ofgeoLat
+ first 2 numbers ofgeoLong
ex:rop4575
- Sample ID:
siteID
+aDateStart
+index
ex:rop45752210093
- Measure Report ID:
sampleID
+date
+index
ex:rop457522100932210122
- Instrument ID: first 3 letters of
model
+ first 3 letters ofname
ex:GraHer
- Organization ID:
addID
+ first 3 letters ofname
ex:ottK1Jdel
- Contact ID:
orgID
+ first 3 letters ofrole
ex:ottK1JdelRes
- Polygon ID:
orgID
+siteID
ex:ottK1Jdelrop4575
- Address ID: first 3 letters of
city
+ first 3 letters ofpCode
ex:ottK1J
- Dataset ID: have
datasetID
==orgID
? - Measure Set Report ID:
measRepID
+ the number of measures in the set to 2 digits ex:rop45752210093221012202
- Step ID:
orgID
+ last 3 letters ofmethodID
ex:ottK1JdelVol
- Step Index ID:
stepID
+index
to two digits ex:ottK1JdelVol02
- Method Set ID:
setVers
+lastEdited
ex:01221009
- Method Set Report ID:
methSetID
+stepIndexID
ex:01221009ottK1JdelVol02
These last few method ones are more things that I think we discussed and approved previously, but I don't know that they had been explicitly captured in writing. Essentially, stepID
is a unique identifier for a method step (a single row in the method step table), while stepIndexID
is just the index
field and the stepID
merged together, and methodSetID
is the unique identifier for a set of methods, and then stepIndexID
and methodSetID
are merged together (each is half of the composite key) to make the methodSetReportID
(the primary key/ complete composite key) which is the unique ID for a method set.
I am thinking that these IDs are getting quite long and ungainly, so more creative ID sources (ie. not generated from table metadata) might be required.
Please feel free to tear these apart, or even reject the premise that there needs to be a recommended template for these. I just wanted to start off some brainstroming.
Based on Thursday's meeting, it seems like there is a need to potentially conceptualize a slightly different approach to naming recommendations depending on:
- whether the user-generated IDs are being used by the lab internally for tracking, or being used for a single lab's data, or
- whether the data is being aggregated or combined from multiple labs/data custodians.
To address this first issue, it was decided that dataID
would be appended to all user generated IDs as a prefix so that they remain unique despite the shared formula for ID generation.
It was also brought up that the character limit for ID length should ideally be under 30 characters - which may be another reason to avoid using dashes and underscores in IDs.
As far as revisions to the proposed to the ID formulas in the previous comment:
- siteID: stakeholders mentioned that it was important to capture negative values in the latitude and longitude (perhaps by using the letter n?) to preserve uniqueness, along with maybe using the first three numbers instead of only 2. It would also perhaps benefit from using an auto-index to avoid duplication. Alternatively, it was suggested that using ISO country codes(https://en.wikipedia.org/wiki/ISO_3166-2) or the first two letters of the country name + the first two letters of the county or province + the first 3 or 4 letters from the site name. Ex:
CaOnOpec
. And then there's no need for recording latitude and longitude in the ID (preferred) - Sample ID: instead use
siteID
+collDT
orcollDTEnd
(depend on what's available) +index
withindex
being a new and arbitrary field added here.
Additional edits to follow next week.
Building off of last week's meeting, we arrived at:
- SiteID: Use country code (ISO) + province code (ISO) + first 2-3 letters of municipality + site name (set a general max of 15 characters) ex: CaOnOttRopec
- Measure Report ID:
sampleID
+measureID
+row number
ex: CaOnOttRope1020221y505h001 - Instrument ID: first 3 letters of
manufacturer
+first 3 letters ofname
+ first 3 letters ofmodel
+index
ex: BioGraHer1
Started to discuss additional IDs,
- Address ID: Use country code + province code + 2-3 municipality letters + row # ex: CaONott3
Will continue with Organization ID
and Dataset ID
next week.