Create high confidence datasets for ER signal sequence (true positives and true negatives)
Opened this issue · 12 comments
around 100? in each set?
Suggest use well known proteins that already have a SigPep for positive set
The negative set should be fairly easy too
Here are
111 likely true positives
https://www.pombase.org/results/from/id/9f10f4a5-a24d-4da0-8e22-ebd15948e650
can you see if you agree?
They seem good to me !
Here is a set of true negatives.
https://www.pombase.org/results/from/id/e8866138-2f9e-4896-8c2b-1501b53b2ae1
I included some mitochondrial proteins as the predictors seem quite bad at those;.
Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.
It predicts the 2 of the true negatives have signal peptides (SPAC3G9.01 and SPBC16C6.08c). But SignalP predicts that those two don't have signal peptides.
Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.
SignalP in "fast" mode predicts 68. It finds one that Phobius doesn't (SPAC959.05c) and there are 9 that Phobius finds that "fast" SignalP doesn't:
- SPAC12G12.12
- SPAC1486.02c
- SPAC630.12
- SPAC8F11.10c
- SPBC1105.05
- SPBC4C3.08
- SPBC530.16
- SPBC8D2.17
- SPCC736.04c
SignalP in "fast" mode predicts 68.
In slow/accurate mode it find fewer matches: 63.
In that mode there are 14 found by Phobius that SignalP doesn't report:
- SPAC12G12.12
- SPAC1486.02c
- SPAC17G8.08c
- SPAC630.12
- SPAC8F11.10c
- SPBC1683.08
- SPBC4B4.08
- SPBC4C3.08
- SPBC530.16
- SPBC8D2.17
- SPBC947.10
- SPCC1235.14
- SPCC548.07c
- SPCC736.04c
In that mode there are 14 found by Phobius that SignalP doesn't report:
These are all expected to have signal peptides
and what is the threshold? Or is it a binary cut-off?
and what is the threshold? Or is it a binary cut-off?
Phobius just reports signal peptide or not but I haven't investigated if there are any command line options to tweak things.
For SignalP the cutoff seems to be a likelihood of 0.5
Here are the likelihood scores for the 14 genes that SignalP says don't have signal peptides. Mostly very low. SignalP doesn't report the coordinates if the likelihood is less than the cutoff (0.5).
gene | likelihood |
---|---|
SPAC17G8.08c SPAC17G8.08c_gdt2_Golgi Ca_2___H___ antiporter Gdt2 | 0.4427 |
SPCC548.07c SPCC548.07c_ght1_plasma membrane high-affinity glucose_proton symporter Ght1 | 0.3333 |
SPCC736.04c SPCC736.04c_gma12_Golgi alpha-1,2-galactosyltransferase Gma12 | 0.2774 |
SPAC1486.02c SPAC1486.02c_dsc2_Golgi Dsc E3 ligase complex transmembrane subunit, C-terminal UBA domain Dsc2 | 0.1697 |
SPBC8D2.17 SPBC8D2.17_gmh4_Golgi alpha-1,6-galactosyltransferase Gmh4 | 0.1669 |
SPBC947.10 SPBC947.10_dsc1_Golgi Dsc E3 ligase complex subunit Dsc1 | 0.1666 |
SPAC630.12 SPAC630.12_ted2_GPI-remodelling mannose-ethanolamine phosphate phosphodiesterase Ted2 | 0.1666 |
SPCC1235.14 SPCC1235.14_ght5_plasma membrane high-affinity glucose_fructose_proton symporter Ght5 | 0.1251 |
SPBC1683.08 SPBC1683.08_ght4_plasma membrane hexose_proton symporter, unknown specificity Ght4 | 0.0122 |
SPAC12G12.12 SPAC12G12.12_gms2_Golgi UDP-galactose transmembrane transporter Gms2 | 0.0000 |
SPBC4B4.08 SPBC4B4.08_ght2_plasma membrane glucose_fructose_proton symporter Ght2 | 0.0000 |
SPAC8F11.10c SPAC8F11.10c_pvg1_Golgi pyruvyltransferase Pvg1 | 0.0000 |
SPBC4C3.08 SPBC4C3.08_otg2_alpha-1,3-galactosyltransferase Otg2 | 0.0000 |
SPBC530.16 SPBC530.16_ksh1_Golgi kish family protein Ksh1 | 0.0000 |
I've just tried the "111 likely true positives" sequence in DeepSig and there were 69 matches. All of those were also predicted by Phobius. There were 17 predicted by Phobius that were not predicted by DeepSig.
Disappointing!