Add `aliases` / `mappings` / `annotations` for GOLD platform / model CV
sujaypatil96 opened this issue ยท 15 comments
Objective:
Store mappings recorded in this issue comment as "mappings" on InstrumentVendorEnum and InstrumentModelEnum.
Decide the LinkML construct to be used to store these mappings, i.e., whether to use aliases
, mappings
or annotations
.
Implications:
This is required for the work being done to "migrate" the GOLD translator to make it conformant with the berkeley schema. See microbiomedata/nmdc-runtime#656
MVP here should be Illumina models, more work is needed across the project for pacbio and oxford.
JGI DW | InstrumentVendorEnum | InstrumentModelEnum |
---|---|---|
Illumina HiSeq | illumina | hiseq |
Illumina HiSeq-HO | illumina | hiseq |
Illumina HiSeq-Rapid | illumina | hiseq |
Illumina HiSeq-1TB | illumina | hiseq |
Illumina HiSeq2500 | illumina | hiseq_2500 |
Illumina HiSeq 2500-1TB | illumina | hiseq_2500 |
Illumina MiSeq | illumina | miseq |
Illumina NextSeq-MO | illumina | nextseq_500 |
Illumina NextSeq-HO | illumina | nextseq_500 |
Illumina X10 | illumina | hiseq_x_ten |
Illumina NovaSeq | illumina | novaseq |
Illumina NovaSeq SP | illumina | novaseq_6000 |
Illumina NovaSeq S4 | illumina | novaseq_6000 |
Illumina NovaSeq S2 | illumina | novaseq_6000 |
NovaSeqX 10B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
NovaSeqX 25B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
NovaSeqX 1.5B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
Thanks fro this info, @aclum. Where did you get the controlled vocabulary (so @sujay and I can revisit it in the future)?
I want to move forward quickly on this, but there are at lest two problems:
- the MVP list of models is really a mixture of instrument names, instruments + kit/flowcell names and instrument families. We need to be consistent, preferably aligning to an ontology like OBI
- we have an impedance mismatch: berkeley-schema-fy24 represents instruments and models. GOLD seems to represent platforms, families, models and more. We need to be intentional about how we align those two spaces with different levels.
I'm sure we can work this out. @sujaypatil96 and I have been consulting the Illumina support page and OBI's representation of sequencing instruments. We'll have more insights soon.
Remember, the schema is frozen, so that may effect turn around time!
use structured aliases
https://linkml.io/linkml-model/latest/docs/structured_aliases/
The CV for GOLD comes from a query of an internal JGI database table. I updated this to a table to include what it should map to. I hope this will inform if we should use structured aliases vs adding a slot to Instrument. My preference is to use the enums, since what is in gold is a display and not how JGI names their individual instruments internally. @turbomam @kheal @sujaypatil96
Thanks @aclum . Could you share a rawer form of the instrument values from the GOLD database contents? Maybe unique values (with or without counts). And the corresponding query? I know most of us wouldn't be able to run the query, but it would be good to have a record of it.
Example: https://gold.jgi.doe.gov/project?id=Gp0127656
But the relevant field (Sequencing Technology = "Illumina HiSeq 2500-1TB") isn't included in https://gold-ws.jgi.doe.gov/api/v1/projects?projectGoldId=Gp0127656 ?
see also
The instrument vocabulary doesn't seem to be included in GOLD CVs Excel from https://gold.jgi.doe.gov/downloads
And the Sequencing Project tab in Public Studies/Biosamples/SPs/APs/Organisms Excel doesn't seem to have a column for the values we're talking about
I guess that's why you did a database query
A StructuredAlias
solution would look something like this:
name: InstrumentModelEnum
permissible_values:
hiseq_2500:
meaning: OBI:0002002
aliases:
- Illumina HiSeq 2500
structured_aliases:
- literal_form: Illumina HiSeq-1TB
alias_contexts:
- https://gold.jgi.doe.gov/
alias_predicate: RELATED_SYNONYM
- literal_form: Illumina HiSeq2500
alias_predicate: EXACT_SYNONYM
alias_contexts:
- https://gold.jgi.doe.gov/
I recommend providing mappings from the GOLD strings to both the InstrumentModelEnum and the InstrumentVendorEnum
If we used StructuredAlias
es like that, then the ETL application might have to iterate through all of the structured aliases in both enums to find the appropriate vendor and model values.
Or the ETL could pre-generate a data structure, in memory, in the opposite direction like
- Illumina HiSeq-1TB
vendor_pv: Illumina
model_pv: hiseq_2500
- Illumina HiSeq2500
vendor_pv: Illumina
model_pv: hiseq_2500
@turbomam the info is in the response body as seqMethod. The cv table is not from gold but in a system called data warehouse which is why you aren't seeing it listed as a gold cv.
info is in the response body as seqMethod
Ha, I was searching for "sequence". Should have searched for the value, "Illumina HiSeq 2500-1TB"
These are the counts for Metagenome Drafts internally. Note GOLD is doing some small manipulation sometimes and there is also not internal consistency b/w the cv table and what is in the all inclusive report.
count sdm_actual_seq_model
18 NextSeq MO
41 NextSeq HO
100 HiSeq-2500 Rapid V2
134 HiSeq-2500
153 HiSeq-2500 Rapid
370 MiSeq
1203 HiSeq-2000
1614 HiSeq-2000 1TB
2069 NovaSeq
2205 NovaSeqX
3316 HiSeq-2500 1TB
11694 NovaSeq S4
JGI DW count | sdm_actual_seq_model |
---|---|
18 | NextSeq MO |
41 | NextSeq HO |
100 | HiSeq-2500 Rapid V2 |
134 | HiSeq-2500 |
153 | HiSeq-2500 Rapid |
370 | MiSeq |
1203 | HiSeq-2000 |
1614 | HiSeq-2000 1TB |
2069 | NovaSeq |
2205 | NovaSeqX |
3316 | HiSeq-2500 1TB |
11694 | NovaSeq S4 |
Update: GOLD curates a list that is different than JGI DW. Will use the subset of common ones based on the JGI DW counts for structured aliases.
Illumina
Illumina GA
Illumina GAII
Illumina GAIIe
Illumina GAIIx
Illumina HiScanSQ
Illumina HiSeq
Illumina HiSeq 1000
Illumina HiSeq 1500
Illumina HiSeq 2000
Illumina HiSeq 2500
Illumina HiSeq 2500-1TB
Illumina HiSeq 2500-Rapid
Illumina HiSeq 3000
Illumina HiSeq 4000
Illumina HiSeq X Ten
Illumina iSeq 100
Illumina MiniSeq
Illumina MiSeq
Illumina NextSeq
Illumina NextSeq 500
Illumina NextSeq 550
Illumina NextSeq-HO
Illumina NextSeq-MO
Illumina NovaSeq
Illumina NovaSeq 6000
Illumina NovaSeq S2
Illumina NovaSeq S4
Illumina NovaSeq SP