smortezah/smashpp

How to optimize for my genome of interest???

ondrej77 opened this issue · 6 comments

Hi,
Really enjoyed reading your paper and I though smashpp would be really useful for some of my investigation. I tested your program on viral sequences of interest and I feel I am not getting the results I expected. Briefly, seems like smash is not recognizing many parts of the two viral genomes compared and producing visualization connecting all the viral blocks. I am probably not setting the program parameters correctly. I wanted to ask how to optimize the smashpp parameters to my viral sequences of interest. I ran smashpp with default settings and could not figure out how to tweak the model parameters. I also ran MUAVE and got what I expected, have an image attached. Could you please help me get similar results for smashpp? Example files attached. Your help much appreciated.

info:

  • comparing reference HBV genome (gtC.fasta) to integrated HBV DNA in human (seq_sample1.fasta)
  • divergence could be up to 15-20% for viral portion
  • the integrated HBV DNA fasta file (seq_sample1.fasta) has part viral and part human

Sincerely, Ondrej
SMASH_example.zip

joqb commented

Hi,
I would appreciate also some more details on the parameters. For example what is the unit used for the -m parameter, how to test different parameters for -rm and -tm...?

Cheers,
Jonathan

Hi, any advice please?

Dear Ondrej and Jonathan,

you may try the following for the first two:

smashpp -v -rm 14,0,0.1,0.95/8,0,0.1,0.95 -rb 10 -re 10 -tb 10 -te 10 -f 30 -m 3 -th 1.92 -t seq_sample1.fasta -e 1.0 -r gtC.fasta ; 
smashpp -viz -o output.svg gtC.fasta.seq_sample1.fasta.pos

(it should be identical to the others)

More info:

-rm 14,0,0.1,0.95/8,0,0.1,0.95
Uses a mixture of two models:

  • a context model with size 14 and an estimator parameter of 0.1
  • a substitution tolerant context model of size 14 allowing 8 substitutions (out of 14 bases) and an estimatior paramter of 0.1.

-rm [NB_C],0,0.1,[NB_G]/[NB_S],0,0.1,0.95

       [NB_C]: (integer [1;20]) order size of the regular context   
               model. Higher values use more RAM but, usually, are  
               related to a better compression score.               
               
       [NB_G]: (real [0;1)) real number to define gamma. This value 
               represents the decayment forgetting factor of the    
               regular context model in definition.                 

       [NB_S]: (integer [0;20]) maximum number of editions allowed  
               to use a substitutional tolerant model with the same 
               memory model of the regular context model with       
               order size equal to [NB_C]. The value 0 stands for   
               turning the tolerant context model off. When the     
               model is on, it pauses when the number of editions   
               is higher that [NB_C], while it is turned on when    
               a complete match of size [NB_C] is seen again. This  
               is probabilistic-algorithmic model very usefull to   
               handle the high substitutional nature of genomic     
               sequences. When [NB_S] > 0, the compressor used more 
               processing time, but uses the same RAM and, usually, 
               achieves a substantial higher compression ratio. The 
               impact of this model is usually only noticed for     
               [NB_C] >= 14.          

Also, you may find more information in the article supplementary material.

Best regards,
Diogo

Dear Diogo,

I have a similar question about running smashpp. The pair of sequences I am working with are assembled chromosomes from plants. The chromosome length was around 40 Mbp. With Mummer4 I did alignment base analysis and found abundant SNPs, small InDels and SVs. I have run smashpp using the parameter for human in your paper (-rm 14,0,0.001,0.95). is this reasonable in your opinion?
I am also not sure about the setting for -f -d and -m. In your pythion script, the settings for "QUANTITY" and "QUALITY" make a huge difference in results. Currently I used "-f 100 -d 10000 -m 100000". Is this reasonable if my focus is finding large inversions (100 Kbp and above).
Many thanks in advance,
Kind regards.
Weihong

Dears,

Some update
for the plants, the following parameters worked well:

#!/bin/bash
./smashpp -v -nr -rm 20,0,0.001,0.95/5,0,0.001,0.95 -f 100000 -d 1 -m 20000 -th $3 -r $1 -t $2
./smashpp -viz -nrr -nr -m 10000 -o $1.$2.$3.svg $1.$2.pos

Best regards,
Diogo