microbiomedata/nmdc-schema

Add `aliases` / `mappings` / `annotations` for GOLD platform / model CV

sujaypatil96 opened this issue ยท 15 comments

Objective:

Store mappings recorded in this issue comment as "mappings" on InstrumentVendorEnum and InstrumentModelEnum.

Decide the LinkML construct to be used to store these mappings, i.e., whether to use aliases, mappings or annotations.


Implications:

This is required for the work being done to "migrate" the GOLD translator to make it conformant with the berkeley schema. See microbiomedata/nmdc-runtime#656

MVP here should be Illumina models, more work is needed across the project for pacbio and oxford.

JGI DW InstrumentVendorEnum InstrumentModelEnum
Illumina HiSeq illumina hiseq
Illumina HiSeq-HO illumina hiseq
Illumina HiSeq-Rapid illumina hiseq
Illumina HiSeq-1TB illumina hiseq
Illumina HiSeq2500 illumina hiseq_2500
Illumina HiSeq 2500-1TB illumina hiseq_2500
Illumina MiSeq illumina miseq
Illumina NextSeq-MO illumina nextseq_500
Illumina NextSeq-HO illumina nextseq_500
Illumina X10 illumina hiseq_x_ten
Illumina NovaSeq illumina novaseq
Illumina NovaSeq SP illumina novaseq_6000
Illumina NovaSeq S4 illumina novaseq_6000
Illumina NovaSeq S2 illumina novaseq_6000

NovaSeqX 10B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
NovaSeqX 25B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
NovaSeqX 1.5B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley

Thanks fro this info, @aclum. Where did you get the controlled vocabulary (so @sujay and I can revisit it in the future)?

I want to move forward quickly on this, but there are at lest two problems:

  1. the MVP list of models is really a mixture of instrument names, instruments + kit/flowcell names and instrument families. We need to be consistent, preferably aligning to an ontology like OBI
  2. we have an impedance mismatch: berkeley-schema-fy24 represents instruments and models. GOLD seems to represent platforms, families, models and more. We need to be intentional about how we align those two spaces with different levels.

I'm sure we can work this out. @sujaypatil96 and I have been consulting the Illumina support page and OBI's representation of sequencing instruments. We'll have more insights soon.

Remember, the schema is frozen, so that may effect turn around time!

The CV for GOLD comes from a query of an internal JGI database table. I updated this to a table to include what it should map to. I hope this will inform if we should use structured aliases vs adding a slot to Instrument. My preference is to use the enums, since what is in gold is a display and not how JGI names their individual instruments internally. @turbomam @kheal @sujaypatil96

Thanks @aclum . Could you share a rawer form of the instrument values from the GOLD database contents? Maybe unique values (with or without counts). And the corresponding query? I know most of us wouldn't be able to run the query, but it would be good to have a record of it.

Example: https://gold.jgi.doe.gov/project?id=Gp0127656

But the relevant field (Sequencing Technology = "Illumina HiSeq 2500-1TB") isn't included in https://gold-ws.jgi.doe.gov/api/v1/projects?projectGoldId=Gp0127656 ?

The instrument vocabulary doesn't seem to be included in GOLD CVs Excel from https://gold.jgi.doe.gov/downloads

And the Sequencing Project tab in Public Studies/Biosamples/SPs/APs/Organisms Excel doesn't seem to have a column for the values we're talking about

I guess that's why you did a database query

A StructuredAlias solution would look something like this:

name: InstrumentModelEnum
permissible_values:
  hiseq_2500:
    meaning: OBI:0002002
    aliases:
    - Illumina HiSeq 2500
    structured_aliases:
      - literal_form: Illumina HiSeq-1TB
        alias_contexts:
        -  https://gold.jgi.doe.gov/
        alias_predicate: RELATED_SYNONYM
      - literal_form: Illumina HiSeq2500
        alias_predicate: EXACT_SYNONYM
        alias_contexts:
        -  https://gold.jgi.doe.gov/

I recommend providing mappings from the GOLD strings to both the InstrumentModelEnum and the InstrumentVendorEnum

If we used StructuredAliases like that, then the ETL application might have to iterate through all of the structured aliases in both enums to find the appropriate vendor and model values.

Or the ETL could pre-generate a data structure, in memory, in the opposite direction like

- Illumina HiSeq-1TB
    vendor_pv: Illumina
    model_pv: hiseq_2500
- Illumina HiSeq2500
    vendor_pv: Illumina
    model_pv: hiseq_2500

@turbomam the info is in the response body as seqMethod. The cv table is not from gold but in a system called data warehouse which is why you aren't seeing it listed as a gold cv.

info is in the response body as seqMethod

Ha, I was searching for "sequence". Should have searched for the value, "Illumina HiSeq 2500-1TB"

These are the counts for Metagenome Drafts internally. Note GOLD is doing some small manipulation sometimes and there is also not internal consistency b/w the cv table and what is in the all inclusive report.
count sdm_actual_seq_model
18 NextSeq MO
41 NextSeq HO
100 HiSeq-2500 Rapid V2
134 HiSeq-2500
153 HiSeq-2500 Rapid
370 MiSeq
1203 HiSeq-2000
1614 HiSeq-2000 1TB
2069 NovaSeq
2205 NovaSeqX
3316 HiSeq-2500 1TB
11694 NovaSeq S4

JGI DW count sdm_actual_seq_model
18 NextSeq MO
41 NextSeq HO
100 HiSeq-2500 Rapid V2
134 HiSeq-2500
153 HiSeq-2500 Rapid
370 MiSeq
1203 HiSeq-2000
1614 HiSeq-2000 1TB
2069 NovaSeq
2205 NovaSeqX
3316 HiSeq-2500 1TB
11694 NovaSeq S4

Update: GOLD curates a list that is different than JGI DW. Will use the subset of common ones based on the JGI DW counts for structured aliases.
Illumina
Illumina GA
Illumina GAII
Illumina GAIIe
Illumina GAIIx
Illumina HiScanSQ
Illumina HiSeq
Illumina HiSeq 1000
Illumina HiSeq 1500
Illumina HiSeq 2000
Illumina HiSeq 2500
Illumina HiSeq 2500-1TB
Illumina HiSeq 2500-Rapid
Illumina HiSeq 3000
Illumina HiSeq 4000
Illumina HiSeq X Ten
Illumina iSeq 100
Illumina MiniSeq
Illumina MiSeq
Illumina NextSeq
Illumina NextSeq 500
Illumina NextSeq 550
Illumina NextSeq-HO
Illumina NextSeq-MO
Illumina NovaSeq
Illumina NovaSeq 6000
Illumina NovaSeq S2
Illumina NovaSeq S4
Illumina NovaSeq SP