dfermin/lucXor

PepXML reader multiple hits for same spectrum_query

Closed this issue · 7 comments

Hi @dfermin @chhh :

In this line

curPSM = new PSM();
the reader creates a PSM from the pepXML. However, every PSM can have a set of Hits associated (at least is possible in pepXML) but the Reader only captures one (the last one I guess but the logic) however all the ptms for all the hits are added.

In summary if more than 1 hit is reported by PSM, then the algorithm will fail.

Can we create a Lucxor PSM for every Hit instead @dfermin @chhh ?

Regards
Yasset

This is the correct behavior. Luciphor was written to take as input from a PeptideProphet XML file which only contains a single hit for a PSM. You are meant to run the TPP tool all the way through until at least getting the PeptideProphet XML file. So if you are getting this error you haven't finished running the pipeline.

I have a file from interact from TPP and one of the sections spectrum_query:

<spectrum_query spectrum="SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.02091.02091.2" spectrumNativeID="controllerType=0 controllerNumber=1 scan=2091" start_scan="2091" end_scan="2091" precursor_neutral_mass="1744.609837" assumed_charge="2" index="140" retention_time_sec="682.3">
<search_result>
<search_hit hit_rank="1" peptide="AYLLLSLEGRWS" peptide_prev_aa="R" peptide_next_aa="G" protein="DECOY_sp|DHE3_BOVIN|" num_tot_proteins="1" num_matched_ions="1" tot_num_ions="22" calc_neutral_pep_mass="1742.616025" massdiff="1.993812" num_tol_term="1" num_missed_cleavages="1" num_matched_peptides="20305">
<modification_info modified_peptide="AY[243]LLLS[167]LEGRW[202]S[247]">
<mod_aminoacid_mass position="2" mass="243.029660" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="6" mass="166.998359" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="11" mass="202.074213" variable="15.994900" source="param"/>
<mod_aminoacid_mass position="12" mass="246.964690" variable="159.932662" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.250"/>
<search_score name="deltacn" value="0.000"/>
<search_score name="deltacnstar" value="0.001"/>
<search_score name="spscore" value="1.7"/>
<search_score name="sprank" value="8"/>
<search_score name="expect" value="3.05E+01"/>
<analysis_result analysis="peptideprophet">
<peptideprophet_result probability="0.0000" all_ntt_prob="(0.0000,0.0000,0.0000)">
<search_score_summary>
<parameter name="fval" value="-8.4177"/>
<parameter name="ntt" value="1"/>
<parameter name="nmc" value="1"/>
<parameter name="massd" value="-7.384"/>
<parameter name="isomassd" value="2"/>
</search_score_summary>
</peptideprophet_result>
</analysis_result>
<analysis_result analysis="interprophet">
<interprophet_result probability="0" all_ntt_prob="(0,0,0)">
<search_score_summary>
<parameter name="nrs" value="0"/>
<parameter name="nsi" value="0"/>
<parameter name="nsm" value="0"/>
<parameter name="nsp" value="2.3533"/>
</search_score_summary>
</interprophet_result>
</analysis_result>
</search_hit>
<search_hit hit_rank="1" peptide="AYLLLSLEGRWS" peptide_prev_aa="R" peptide_next_aa="G" protein="DECOY_sp|DHE3_BOVIN|" num_tot_proteins="1" num_matched_ions="1" tot_num_ions="22" calc_neutral_pep_mass="1742.616025" massdiff="1.993812" num_tol_term="1" num_missed_cleavages="1" num_matched_peptides="20305">
<modification_info modified_peptide="A[151]YLLLS[167]LEGRW[202]S[247]">
<mod_aminoacid_mass position="1" mass="151.003445" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="6" mass="166.998359" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="11" mass="202.074213" variable="15.994900" source="param"/>
<mod_aminoacid_mass position="12" mass="246.964690" variable="159.932662" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.250"/>
<search_score name="deltacn" value="0.000"/>
<search_score name="deltacnstar" value="0.000"/>
<search_score name="spscore" value="1.7"/>
<search_score name="sprank" value="8"/>
<search_score name="expect" value="3.05E+01"/>
</search_hit>
<search_hit hit_rank="1" peptide="CTVNALEVERQAQ" peptide_prev_aa="R" peptide_next_aa="H" protein="sp|K1H7_HUMAN|" num_tot_proteins="2" num_matched_ions="1" tot_num_ions="24" calc_neutral_pep_mass="1741.570867" massdiff="3.038970" num_tol_term="1" num_missed_cleavages="1" num_matched_peptides="20305">
<alternative_protein protein="sp|K1H8_HUMAN|" num_tol_term="1" peptide_prev_aa="R" peptide_next_aa="H"/>
<modification_info modified_peptide="C[143]T[261]VN[115]ALEVERQA[151]Q[129]">
<mod_aminoacid_mass position="1" mass="143.004100" static="57.021464" variable="-17.026549" source="param"/>
<mod_aminoacid_mass position="2" mass="260.980341" variable="159.932662" source="param"/>
<mod_aminoacid_mass position="4" mass="115.026943" variable="0.984016" source="param"/>
<mod_aminoacid_mass position="12" mass="151.003445" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="13" mass="129.042594" variable="0.984016" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.250"/>
<search_score name="deltacn" value="0.000"/>
<search_score name="deltacnstar" value="0.001"/>
<search_score name="spscore" value="1.6"/>
<search_score name="sprank" value="12"/>
<search_score name="expect" value="3.05E+01"/>
</search_hit>
<search_hit hit_rank="1" peptide="CTVNALEVERQAQ" peptide_prev_aa="R" peptide_next_aa="H" protein="sp|K1H7_HUMAN|" num_tot_proteins="2" num_matched_ions="1" tot_num_ions="24" calc_neutral_pep_mass="1741.570867" massdiff="3.038970" num_tol_term="1" num_missed_cleavages="1" num_matched_peptides="20305">
<alternative_protein protein="sp|K1H8_HUMAN|" num_tol_term="1" peptide_prev_aa="R" peptide_next_aa="H"/>
<modification_info modified_peptide="C[143]T[261]VN[115]ALEVERQ[129]A[151]Q">
<mod_aminoacid_mass position="1" mass="143.004100" static="57.021464" variable="-17.026549" source="param"/>
<mod_aminoacid_mass position="2" mass="260.980341" variable="159.932662" source="param"/>
<mod_aminoacid_mass position="4" mass="115.026943" variable="0.984016" source="param"/>
<mod_aminoacid_mass position="11" mass="129.042594" variable="0.984016" source="param"/>
<mod_aminoacid_mass position="12" mass="151.003445" variable="79.966331" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.250"/>
<search_score name="deltacn" value="0.000"/>
<search_score name="deltacnstar" value="0.000"/>
<search_score name="spscore" value="1.6"/>
<search_score name="sprank" value="12"/>
<search_score name="expect" value="3.05E+01"/>
</search_hit>
<search_hit hit_rank="1" peptide="IEPADTNYRDRR" peptide_prev_aa="R" peptide_next_aa="T" protein="DECOY_sp|GELS_HUMAN|" num_tot_proteins="1" num_matched_ions="1" tot_num_ions="22" calc_neutral_pep_mass="1744.637246" massdiff="-0.027410" num_tol_term="2" num_missed_cleavages="1" num_matched_peptides="20305">
<modification_info modified_peptide="IEPADT[181]NY[323]RDRR">
<mod_aminoacid_mass position="6" mass="181.014010" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="8" mass="322.995991" variable="159.932662" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.250"/>
<search_score name="deltacn" value="1.000"/>
<search_score name="deltacnstar" value="0.000"/>
<search_score name="spscore" value="1.7"/>
<search_score name="sprank" value="9"/>
<search_score name="expect" value="3.05E+01"/>
</search_hit>
</search_result>
<

This output looks strange.
The <peptideprophet_result probability="0.0000" all_ntt_prob="(0.0000,0.0000,0.0000)"> line should only appear once per spectrum_query and should only represent a single "best hit" for a PSM.

This instance wouldn't even be read by Luciphor because the peptide probability is zero. Do you have a spectrum_query example with a probability > 0.95?

Looking at this file it should be fine. The first spectrum_query looks like this:

<spectrum_query spectrum="SF_200217_pPeptideLibrary_pool1_HCDOT_rep1.01643.01643.3" spectrumNativeID="controllerType=0 c
<search_result>
<search_hit hit_rank="1" peptide="SPQCTRVGFPPS" peptide_prev_aa="R" peptide_next_aa="L" protein="DECOY_sp|EGF_HUMAN|" nu
<modification_info modified_peptide="S[167]P[113]QCT[181]RVGFPP[113]S[167]">
<mod_aminoacid_mass position="1" mass="166.998359" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="2" mass="113.047664" variable="15.994900" source="param"/>
<mod_aminoacid_mass position="4" mass="160.030649" static="57.021464"/>
<mod_aminoacid_mass position="5" mass="181.014010" variable="79.966331" source="param"/>
<mod_aminoacid_mass position="11" mass="113.047664" variable="15.994900" source="param"/>
<mod_aminoacid_mass position="12" mass="166.998359" variable="79.966331" source="param"/>
</modification_info>
<search_score name="xcorr" value="0.493"/>
<search_score name="deltacn" value="0.177"/>
<search_score name="deltacnstar" value="0.000"/>
<search_score name="spscore" value="6.5"/>
<search_score name="sprank" value="1"/>
<search_score name="expect" value="8.54E-01"/>
<analysis_result analysis="peptideprophet">
<peptideprophet_result probability="0.1439" all_ntt_prob="(0.0000,0.1439,0.6246)">
<search_score_summary>
<parameter name="fval" value="-4.8422"/>
<parameter name="ntt" value="1"/>
<parameter name="nmc" value="1"/>
<parameter name="massd" value="2.480"/>
<parameter name="isomassd" value="1"/>
</search_score_summary>
</peptideprophet_result>
</analysis_result>
<analysis_result analysis="interprophet">
<interprophet_result probability="0.0289756" all_ntt_prob="(0,0.0289756,0.228023)">
<search_score_summary>
<parameter name="nrs" value="-2.3909"/>
<parameter name="nsi" value="0"/>
<parameter name="nsm" value="0"/>
<parameter name="nsp" value="4.5023"/>
</search_score_summary>
</interprophet_result>
</analysis_result>
</search_hit>
</search_result>
</spectrum_query>

The extracted probability for this PSM should be 0.1439

If a spectrum_query contains multiple search_hit entries that would break the code since it assumes there is only ever on search_hit entry per spectrum_query.

@dfermin your point then is that we should break the code. if you found multiple hits. My question is because I'm changing the pepXML reader for the library that @chhh develops and I found that. I will reproduce the same logic. Another possibility is to take the rank_1 best probability score.

@dfermin @chhh after talking with the TTP guys apparently multiple hits can be reported and only one, will have the peptideprophet_result. I have adapted the code in my branch to be able to handle that case to only import the hit that has peptideprophet probability.