This is a proof-of-concept for embedding sample or experimental metadata in mzML or a compatible data model.
The idea behind this program was to see what kind of information might be portable between
an SDRF file and an mzML file. The SDRF file includes a comment[data file]
column which
tells you which raw file (or mzML file) a row describes. We can build the sampleList
for an mzML file from the SDRF rows that correspond to its source RAW file.
msconvert path/to/RAW | mzmeta path/to/sdrf > path/to/mzML
PXD017710 has been annotated with an SDRF file, describing
a TMT multiplexing experiment with multiple samples per MS run. If mzmeta
were run on 20200219_KKL_SARS_CoV2_pool1_F1.raw
being converted to mzML, it would pull in details describing each of the samples based upon the rows listed for that data file
in the SDRF file.
<sampleList count="9">
<sample id="sample_1" name="Sample 1">
<cvParam accession="BFO:0000040" cvRef="BFO" name="material type" value="cell"/>
<userParam type="xsd:string" name="assay name" value="run 1"/>
<cvParam accession="EFO:0005521" cvRef="EFO" name="technology type" value="proteomic profiling by mass spectrometry"/>
<cvParam accession="OBI:0100026" cvRef="OBI" name="organism" value="Homo sapiens"/>
<cvParam accession="EFO:0000635" cvRef="EFO" name="organism part" value="colon"/>
<userParam type="xsd:string" name="characteristics[sex]" value="male"/>
<cvParam accession="EFO:0000246" cvRef="EFO" name="age" value="72"/>
<cvParam accession="EFO:0000399" cvRef="EFO" name="developmental stage" value="adult"/>
<cvParam accession="HANCESTRO:0000004" cvRef="HANCESTRO" name="ancestry category" value="caucasian"/>
<cvParam accession="EFO:0000324" cvRef="EFO" name="cell type" value="not available"/>
<cvParam accession="EFO:0000408" cvRef="EFO" name="disease" value="colon cancer"/>
<userParam type="xsd:string" name="characteristics[cell line]" value="CaCo-2"/>
<userParam type="xsd:string" name="characteristics[infect]" value="bridge mixed pool"/>
<cvParam accession="EFO:0000721" cvRef="EFO" name="time" value="none"/>
<cvParam accession="EFO:0002091" cvRef="EFO" name="biological replicate" value="1"/>
<cvParam accession="MS:1002621" cvRef="MS" name="TMT reagent 131"/>
<cvParam accession="PRIDE:0000577" cvRef="PRIDE" name="file uri" value="https://ftp.pride.ebi.ac.uk/pride/data/archive/2020/03/PXD017710/20200219_KKL_SARS_CoV2_pool1_F1.raw"/>
<cvParam accession="MS:1000858" cvRef="MS" name="fraction identifier" value="1"/>
<cvParam accession="MS:1001808" cvRef="MS" name="technical replicate" value="1"/>
<userParam type="xsd:string" name="comment[modification parameters]" value="NT=TMT6plex;PP=Any N-term;AC=UNIMOD:737;MT=fixed"/>
<userParam type="xsd:string" name="comment[modification parameters]" value="NT=Carbamidomethyl;TA=C;AC=UNIMOD:4;MT=fixed"/>
<userParam type="xsd:string" name="comment[modification parameters]" value="NT=Oxidation;TA=M;AC=UNIMOD:35;MT=variable"/>
<userParam type="xsd:string" name="comment[modification parameters]" value="NT=13C6-15N4;TA=R;AC=UNIMOD:267;MT=variable"/>
<userParam type="xsd:string" name="comment[cleavage agent details]" value="NT=Trypsin"/>
<userParam type="xsd:string" name="comment[cleavage agent details]" value="NT=Lys-C"/>
<userParam type="xsd:string" name="comment[fragment mass tolerance]" value="not available"/>
<userParam type="xsd:string" name="comment[precursor mass tolerance]" value="not available"/>
<userParam type="xsd:string" name="factor value[infect]" value="bridge mixed pool"/>
<userParam type="xsd:string" name="factor value[time]" value="none"/>
</sample>
In some cases, I could map the columns to controlled vocabulary terms. In others, I chose to just roundtrip all the SDRF field details as-is lacking a clear non-lossy mechanism for encoding that information.