bigbio/quantms

SAGE search engine score is missing after psm re-scoring using percolator

Closed this issue · 22 comments

Description of the bug

SAGE's search engine score should be hyperscore, pyopenms could extrct it with idXMLs after searchengines step. But after psm re-scoring using percolator, it's missing in idXMLs.

idxml before psm re-scoring:

<PeptideIdentification score_type="hyperscore" higher_score_better="true" significance_threshold="0.0" MZ="988.485223533228918" RT="2776.47440000000006" spectrum_reference="controllerType=0 controllerNumber=1 scan=16570" >
	<PeptideHit score="4.372861652440132" sequence="LLGPSLTSTTPASSSSGSSSR" charge="2" aa_before="R" aa_after="G" start="363" end="383" protein_refs="PH_0" >
		<UserParam type="string" name="target_decoy" value="target"/>
		<UserParam type="string" name="ln(-poisson)" value="3.14389626700959"/>
		<UserParam type="string" name="ln(delta_best)" value="0.0"/>
		<UserParam type="string" name="ln(delta_next)" value="3.819337728782414"/>
		<UserParam type="string" name="ln(matched_intensity_pct)" value="3.5826106"/>
		<UserParam type="string" name="longest_b" value="9"/>
		<UserParam type="string" name="longest_y" value="18"/>
		<UserParam type="string" name="longest_y_pct" value="0.85714287"/>
		<UserParam type="string" name="matched_peaks" value="27"/>
		<UserParam type="string" name="scored_candidates" value="8222"/>
		<UserParam type="string" name="protein_references" value="unique"/>
	</PeptideHit>
	<UserParam type="string" name="PinSpecId" value="312"/>
</PeptideIdentification>

idxml after psm re-scoring:

<PeptideIdentification score_type="Posterior Error Probability" higher_score_better="false" significance_threshold="0.0" MZ="988.485223533228918" RT="2776.47440000000006" spectrum_reference="controllerType=0 controllerNumber=1 scan=16570" >
	<PeptideHit score="4.70852e-08" sequence="LLGPSLTSTTPASSSSGSSSR" charge="2" aa_before="R" aa_after="G" start="363" end="383" protein_refs="PH_7718" >
		<UserParam type="string" name="target_decoy" value="target"/>
		<UserParam type="string" name="ln(-poisson)" value="3.14389626700959"/>
		<UserParam type="string" name="ln(delta_best)" value="0.0"/>
		<UserParam type="string" name="ln(delta_next)" value="3.819337728782414"/>
		<UserParam type="string" name="ln(matched_intensity_pct)" value="3.5826106"/>
		<UserParam type="string" name="longest_b" value="9"/>
		<UserParam type="string" name="longest_y" value="18"/>
		<UserParam type="string" name="longest_y_pct" value="0.85714287"/>
		<UserParam type="string" name="matched_peaks" value="27"/>
		<UserParam type="string" name="scored_candidates" value="8222"/>
		<UserParam type="string" name="protein_references" value="unique"/>
		<UserParam type="float" name="MS:1001492" value="2.65249"/>
		<UserParam type="float" name="MS:1001491" value="7.304600000000001e-04"/>
		<UserParam type="float" name="MS:1001493" value="4.70852e-08"/>
	</PeptideHit>
	<UserParam type="string" name="PinSpecId" value="312"/>
</PeptideIdentification>

Command used and terminal output

No response

Relevant files

No response

System information

No response

How is this with other search engines?

It might be because PSMFeatureExtractor can be and is skipped with Sage.

I guess we are not taking the SAGE output but the pin file from percolator?

@jpfeuffer @ypriverol It should be SAGE seach output. Comet and MSGF+ got their search scores in MetaValue of every PeptideHit, but not SAGE.

How does an idXML for comet look like after PSMFeatureExtractor?

Comet search engine score is xcorr -> MetaValue MS:1002252. It's already exist before psm re-scoring.

<PeptideIdentification score_type="Posterior Error Probability" higher_score_better="false" significance_threshold="0.0" MZ="474.761474031899979" RT="1815.299999999999955" spectrum_reference="controllerType=0 controllerNumber=1 scan=3727" >
	<PeptideHit score="0.990159" sequence="LSGATLQMK" charge="2" aa_before="K" aa_after="R" start="48" end="56" protein_refs="PH_1080" >
		<UserParam type="string" name="target_decoy" value="decoy"/>
		<UserParam type="string" name="MS:1002258" value="6"/>
		<UserParam type="string" name="MS:1002259" value="16"/>
		<UserParam type="string" name="num_matched_peptides" value="1060"/>
		<UserParam type="int" name="isotope_error" value="0"/>
		<UserParam type="float" name="MS:1002252" value="1.116"/>
		<UserParam type="float" name="MS:1002253" value="1.0"/>
		<UserParam type="float" name="MS:1002254" value="0.0"/>
		<UserParam type="float" name="MS:1002255" value="113.900000000000006"/>
		<UserParam type="float" name="MS:1002256" value="11.0"/>
		<UserParam type="float" name="MS:1002257" value="2.89"/>
		<UserParam type="string" name="protein_references" value="unique"/>
		<UserParam type="float" name="COMET:deltCn" value="1.0"/>
		<UserParam type="float" name="COMET:deltLCn" value="0.0"/>
		<UserParam type="float" name="COMET:lnExpect" value="1.061256502124341"/>
		<UserParam type="float" name="COMET:lnNumSP" value="6.966024187106113"/>
		<UserParam type="float" name="COMET:lnRankSP" value="2.397895272798371"/>
		<UserParam type="float" name="COMET:IonFrac" value="0.375"/>
		<UserParam type="float" name="MS:1001492" value="-0.641415"/>
		<UserParam type="float" name="MS:1001491" value="0.270715"/>
		<UserParam type="float" name="MS:1001493" value="0.990159"/>
	</PeptideHit>
</PeptideIdentification>

But this is after rescoring. I need to see before.

<PeptideIdentification score_type="expect" higher_score_better="false" significance_threshold="0.0" MZ="474.761474031899979" RT="1815.299999999999955" spectrum_reference="controllerType=0 controllerNumber=1 scan=3727" >
	<PeptideHit score="2.89" sequence="LSGATLQMK" charge="2" aa_before="K" aa_after="R" start="48" end="56" protein_refs="PH_9943" >
		<UserParam type="string" name="MS:1002258" value="6"/>
		<UserParam type="string" name="MS:1002259" value="16"/>
		<UserParam type="string" name="num_matched_peptides" value="1060"/>
		<UserParam type="int" name="isotope_error" value="0"/>
		<UserParam type="float" name="MS:1002252" value="1.116"/>
		<UserParam type="float" name="MS:1002253" value="1.0"/>
		<UserParam type="float" name="MS:1002254" value="0.0"/>
		<UserParam type="float" name="MS:1002255" value="113.900000000000006"/>
		<UserParam type="float" name="MS:1002256" value="11.0"/>
		<UserParam type="float" name="MS:1002257" value="2.89"/>
		<UserParam type="string" name="target_decoy" value="decoy"/>
		<UserParam type="string" name="protein_references" value="unique"/>
	</PeptideHit>
</PeptideIdentification>

Yes so the problem is that we actually use the Comet e-value as main score.
So you are just lucky that you picked a score that is not a main score for the other search engines.

I think this problem is solved. You should take the data @WangHong007 from the SAGE id folder.

according to pipeline_info it is still using the old container

This is percolator no?

yes PercolatorAdapter

I honestly think we should just override the containers for all openms labelled processes until the release. I.e. make the dev profile active by default. Otherwise someone will always forget to change a process.

I actually think we should make variable the containers using in every-process a variable, would that be possible? Something like:

openms_conda_string = "bioconda::openms=2.9.1"
openms_singularity_string = "ghcr.io/openms/openms-executables-sif:latest"
openms_docker_string = "ghcr.io/openms/openms-executables:latest"

I don't like it very much. You will just get confused because suddenly conda uses something different from docker etc. It also confuses users with yet an additional THREE parameters.
The only thing you will ever want is dev or latest. Nothing else.

I have no idea what do you have in mind? How can you make a profile default, can you send me an example and I can do it.

yes. just put it in base.config. The thing is just to remember to remove it when releasing

I was actually thinking to leave it there but then in the nextflow.config import it or not depending on the release cycle. Like in the nextflow.config

includeConfig 'conf/dev.config'

What do you think?

Yes but you need to find out if and how nextflow knows about its release cycle ;)
If it cannot know about it, then having to change one line every release it not much better than just changing 3 lines.