How to optimize for my genome of interest???
ondrej77 opened this issue · 6 comments
Hi,
Really enjoyed reading your paper and I though smashpp would be really useful for some of my investigation. I tested your program on viral sequences of interest and I feel I am not getting the results I expected. Briefly, seems like smash is not recognizing many parts of the two viral genomes compared and producing visualization connecting all the viral blocks. I am probably not setting the program parameters correctly. I wanted to ask how to optimize the smashpp parameters to my viral sequences of interest. I ran smashpp with default settings and could not figure out how to tweak the model parameters. I also ran MUAVE and got what I expected, have an image attached. Could you please help me get similar results for smashpp? Example files attached. Your help much appreciated.
info:
- comparing reference HBV genome (gtC.fasta) to integrated HBV DNA in human (seq_sample1.fasta)
- divergence could be up to 15-20% for viral portion
- the integrated HBV DNA fasta file (seq_sample1.fasta) has part viral and part human
Sincerely, Ondrej
SMASH_example.zip
Hi,
I would appreciate also some more details on the parameters. For example what is the unit used for the -m parameter, how to test different parameters for -rm and -tm...?
Cheers,
Jonathan
Hi, any advice please?
Dear Ondrej and Jonathan,
you may try the following for the first two:
smashpp -v -rm 14,0,0.1,0.95/8,0,0.1,0.95 -rb 10 -re 10 -tb 10 -te 10 -f 30 -m 3 -th 1.92 -t seq_sample1.fasta -e 1.0 -r gtC.fasta ;
smashpp -viz -o output.svg gtC.fasta.seq_sample1.fasta.pos
(it should be identical to the others)
More info:
-rm 14,0,0.1,0.95/8,0,0.1,0.95
Uses a mixture of two models:
- a context model with size 14 and an estimator parameter of 0.1
- a substitution tolerant context model of size 14 allowing 8 substitutions (out of 14 bases) and an estimatior paramter of 0.1.
-rm [NB_C],0,0.1,[NB_G]/[NB_S],0,0.1,0.95
[NB_C]: (integer [1;20]) order size of the regular context
model. Higher values use more RAM but, usually, are
related to a better compression score.
[NB_G]: (real [0;1)) real number to define gamma. This value
represents the decayment forgetting factor of the
regular context model in definition.
[NB_S]: (integer [0;20]) maximum number of editions allowed
to use a substitutional tolerant model with the same
memory model of the regular context model with
order size equal to [NB_C]. The value 0 stands for
turning the tolerant context model off. When the
model is on, it pauses when the number of editions
is higher that [NB_C], while it is turned on when
a complete match of size [NB_C] is seen again. This
is probabilistic-algorithmic model very usefull to
handle the high substitutional nature of genomic
sequences. When [NB_S] > 0, the compressor used more
processing time, but uses the same RAM and, usually,
achieves a substantial higher compression ratio. The
impact of this model is usually only noticed for
[NB_C] >= 14.
Also, you may find more information in the article supplementary material.
Best regards,
Diogo
Dear Diogo,
I have a similar question about running smashpp. The pair of sequences I am working with are assembled chromosomes from plants. The chromosome length was around 40 Mbp. With Mummer4 I did alignment base analysis and found abundant SNPs, small InDels and SVs. I have run smashpp using the parameter for human in your paper (-rm 14,0,0.001,0.95). is this reasonable in your opinion?
I am also not sure about the setting for -f -d and -m. In your pythion script, the settings for "QUANTITY" and "QUALITY" make a huge difference in results. Currently I used "-f 100 -d 10000 -m 100000". Is this reasonable if my focus is finding large inversions (100 Kbp and above).
Many thanks in advance,
Kind regards.
Weihong
Dears,
Some update
for the plants, the following parameters worked well:
#!/bin/bash
./smashpp -v -nr -rm 20,0,0.001,0.95/5,0,0.001,0.95 -f 100000 -d 1 -m 20000 -th $3 -r $1 -t $2
./smashpp -viz -nrr -nr -m 10000 -o $1.$2.$3.svg $1.$2.pos
Best regards,
Diogo