Big-Life-Lab/PHES-ODM

Recommend a site, sample and measure ID names

DougManuel opened this issue · 6 comments

Should ODM include recommended or proposed names for site, sample, and measure IDs? ODM data validation and transformation tools could also generate IDs if they are missing.

SiteID - Europe has a site naming convention. See here. It wasn't clear whether there are naming conventions in other countries. Currently, there is a wide range of approaches that wastewater surveillance programs use for naming sites.

SampleID - Recommended default is site_date_sample#. sampleIDs should be simple. However, it was raised whether the id should have additional info or hints in the ID. It was recognized that sample IDs could get complex with concepts such as different sample types, children, pooling, etc.

MeasureReportID - Recommended default sampleID_date_index.

For the sampleID, just want to clarify how best to do this since some siteIDs in Ontario hasa wide variety of character counts with/without underscores (could be 20+characters).

Should it be that the site portion is either the siteID could it just be a fixed 5 character abbreviation for the site i.e. for Ashbridge's Bay: ASHBR_20220116_01.

We can allow more characters for sampleID if we need or want.
For siteID, I've been wondering whether we want to an additional variable for an siteID abbreviation for purposes such as this.

We could just keep it simple and use the first 5 characters (excluding spaces) of the siteName, so it avoids needing to add in another field.

Looking ahead, it seems like it might be prescient to provide a guide/suggestion for other user-generated IDs as well. Building off our discussion, I've made some suggestions below. The IDs would use SnakeCase to avoid using excess symbols that might cause difficulties in processing as well. Another point is that these IDs should ideally be generated using pieces of metadata already present in the tables so that if they're absent it is easy to machine-generate them.

  • Site ID: first 3 letters of name + first 2 numbers of geoLat + first 2 numbers of geoLong ex: rop4575
  • Sample ID: siteID + aDateStart + index ex: rop45752210093
  • Measure Report ID: sampleID + date + index ex: rop457522100932210122
  • Instrument ID: first 3 letters of model + first 3 letters of name ex: GraHer
  • Organization ID: addID + first 3 letters of name ex: ottK1Jdel
  • Contact ID: orgID + first 3 letters of role ex: ottK1JdelRes
  • Polygon ID: orgID + siteID ex: ottK1Jdelrop4575
  • Address ID: first 3 letters of city + first 3 letters of pCode ex: ottK1J
  • Dataset ID: have datasetID == orgID ?
  • Measure Set Report ID: measRepID + the number of measures in the set to 2 digits ex: rop45752210093221012202
  • Step ID: orgID + last 3 letters of methodID ex: ottK1JdelVol
  • Step Index ID: stepID + index to two digits ex: ottK1JdelVol02
  • Method Set ID: setVers + lastEdited ex: 01221009
  • Method Set Report ID: methSetID + stepIndexID ex: 01221009ottK1JdelVol02

These last few method ones are more things that I think we discussed and approved previously, but I don't know that they had been explicitly captured in writing. Essentially, stepID is a unique identifier for a method step (a single row in the method step table), while stepIndexID is just the index field and the stepID merged together, and methodSetID is the unique identifier for a set of methods, and then stepIndexID and methodSetID are merged together (each is half of the composite key) to make the methodSetReportID (the primary key/ complete composite key) which is the unique ID for a method set.

I am thinking that these IDs are getting quite long and ungainly, so more creative ID sources (ie. not generated from table metadata) might be required.

Please feel free to tear these apart, or even reject the premise that there needs to be a recommended template for these. I just wanted to start off some brainstroming.

Based on Thursday's meeting, it seems like there is a need to potentially conceptualize a slightly different approach to naming recommendations depending on:

  1. whether the user-generated IDs are being used by the lab internally for tracking, or being used for a single lab's data, or
  2. whether the data is being aggregated or combined from multiple labs/data custodians.

To address this first issue, it was decided that dataID would be appended to all user generated IDs as a prefix so that they remain unique despite the shared formula for ID generation.

It was also brought up that the character limit for ID length should ideally be under 30 characters - which may be another reason to avoid using dashes and underscores in IDs.

As far as revisions to the proposed to the ID formulas in the previous comment:

  • siteID: stakeholders mentioned that it was important to capture negative values in the latitude and longitude (perhaps by using the letter n?) to preserve uniqueness, along with maybe using the first three numbers instead of only 2. It would also perhaps benefit from using an auto-index to avoid duplication. Alternatively, it was suggested that using ISO country codes(https://en.wikipedia.org/wiki/ISO_3166-2) or the first two letters of the country name + the first two letters of the county or province + the first 3 or 4 letters from the site name. Ex: CaOnOpec. And then there's no need for recording latitude and longitude in the ID (preferred)
  • Sample ID: instead use siteID + collDT or collDTEnd (depend on what's available) + index with index being a new and arbitrary field added here.

Additional edits to follow next week.

Building off of last week's meeting, we arrived at:

  • SiteID: Use country code (ISO) + province code (ISO) + first 2-3 letters of municipality + site name (set a general max of 15 characters) ex: CaOnOttRopec
  • Measure Report ID: sampleID + measureID + row number ex: CaOnOttRope1020221y505h001
  • Instrument ID: first 3 letters of manufacturer +first 3 letters of name + first 3 letters of model + index ex: BioGraHer1

Started to discuss additional IDs,

  • Address ID: Use country code + province code + 2-3 municipality letters + row # ex: CaONott3

Will continue with Organization ID and Dataset ID next week.