pombase/pombase-chado

Create high confidence datasets for ER signal sequence (true positives and true negatives)

Opened this issue · 12 comments

around 100? in each set?

Suggest use well known proteins that already have a SigPep for positive set

The negative set should be fairly easy too

Here are
111 likely true positives
https://www.pombase.org/results/from/id/9f10f4a5-a24d-4da0-8e22-ebd15948e650
can you see if you agree?

They seem good to me !

Here is a set of true negatives.
https://www.pombase.org/results/from/id/e8866138-2f9e-4896-8c2b-1501b53b2ae1
I included some mitochondrial proteins as the predictors seem quite bad at those;.

Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.
It predicts the 2 of the true negatives have signal peptides (SPAC3G9.01 and SPBC16C6.08c). But SignalP predicts that those two don't have signal peptides.

Phobius on my desktop predicts 76 of the 111 likely true positives have signal peptides.

SignalP in "fast" mode predicts 68. It finds one that Phobius doesn't (SPAC959.05c) and there are 9 that Phobius finds that "fast" SignalP doesn't:

  • SPAC12G12.12
  • SPAC1486.02c
  • SPAC630.12
  • SPAC8F11.10c
  • SPBC1105.05
  • SPBC4C3.08
  • SPBC530.16
  • SPBC8D2.17
  • SPCC736.04c

SignalP in "fast" mode predicts 68.

In slow/accurate mode it find fewer matches: 63.

In that mode there are 14 found by Phobius that SignalP doesn't report:

  • SPAC12G12.12
  • SPAC1486.02c
  • SPAC17G8.08c
  • SPAC630.12
  • SPAC8F11.10c
  • SPBC1683.08
  • SPBC4B4.08
  • SPBC4C3.08
  • SPBC530.16
  • SPBC8D2.17
  • SPBC947.10
  • SPCC1235.14
  • SPCC548.07c
  • SPCC736.04c

In that mode there are 14 found by Phobius that SignalP doesn't report:

These are all expected to have signal peptides

and what is the threshold? Or is it a binary cut-off?

and what is the threshold? Or is it a binary cut-off?

Phobius just reports signal peptide or not but I haven't investigated if there are any command line options to tweak things.

For SignalP the cutoff seems to be a likelihood of 0.5

Here are the likelihood scores for the 14 genes that SignalP says don't have signal peptides. Mostly very low. SignalP doesn't report the coordinates if the likelihood is less than the cutoff (0.5).

gene likelihood
SPAC17G8.08c SPAC17G8.08c_gdt2_Golgi Ca_2___H___ antiporter Gdt2 0.4427
SPCC548.07c SPCC548.07c_ght1_plasma membrane high-affinity glucose_proton symporter Ght1 0.3333
SPCC736.04c SPCC736.04c_gma12_Golgi alpha-1,2-galactosyltransferase Gma12 0.2774
SPAC1486.02c SPAC1486.02c_dsc2_Golgi Dsc E3 ligase complex transmembrane subunit, C-terminal UBA domain Dsc2 0.1697
SPBC8D2.17 SPBC8D2.17_gmh4_Golgi alpha-1,6-galactosyltransferase Gmh4 0.1669
SPBC947.10 SPBC947.10_dsc1_Golgi Dsc E3 ligase complex subunit Dsc1 0.1666
SPAC630.12 SPAC630.12_ted2_GPI-remodelling mannose-ethanolamine phosphate phosphodiesterase Ted2 0.1666
SPCC1235.14 SPCC1235.14_ght5_plasma membrane high-affinity glucose_fructose_proton symporter Ght5 0.1251
SPBC1683.08 SPBC1683.08_ght4_plasma membrane hexose_proton symporter, unknown specificity Ght4 0.0122
SPAC12G12.12 SPAC12G12.12_gms2_Golgi UDP-galactose transmembrane transporter Gms2 0.0000
SPBC4B4.08 SPBC4B4.08_ght2_plasma membrane glucose_fructose_proton symporter Ght2 0.0000
SPAC8F11.10c SPAC8F11.10c_pvg1_Golgi pyruvyltransferase Pvg1 0.0000
SPBC4C3.08 SPBC4C3.08_otg2_alpha-1,3-galactosyltransferase Otg2 0.0000
SPBC530.16 SPBC530.16_ksh1_Golgi kish family protein Ksh1 0.0000

I've just tried the "111 likely true positives" sequence in DeepSig and there were 69 matches. All of those were also predicted by Phobius. There were 17 predicted by Phobius that were not predicted by DeepSig.

Disappointing!