pro scan context_length = 9 failed, shorter `-c` induces systematic errors
ruolin opened this issue · 4 comments
Hello, I am using pro
to scan homopolymer sites. When I increase the context length from 5(default) to 9, the homopolymer A results from *_all got dropped. When I use context length 8, however, everything is fine. There is an intrinsic problem with using short context lenght. That's why I increase to 9. If you are interested, we can discuss the problem. But that is for another topic.
The command I use, pro version 1.2.0
msisensor-pro scan -d $HG19 -o hg19_hp.tsv -l 8 -c 9 -p 1
msisensor-pro pro -d hg19_hp.tsv -t $BAM -c 1 -x 1 -b 4 -o regular -e $BED -i 0.1 -l 5
I just check the code. It seems that the context length cannot be large than 8, since you use bit16 to store the context.
bit16_t flankH = 0;
bit16_t flankT = 0;
Hello, I am using
pro
to scan homopolymer sites. When I increase the context length from 5(default) to 9, the homopolymer A results from *_all got dropped. When I use context length 8, however, everything is fine. There is an intrinsic problem with using short context lenght. That's why I increase to 9. If you are interested, we can discuss the problem. But that is for another topic.
The command I use, pro version 1.2.0msisensor-pro scan -d $HG19 -o hg19_hp.tsv -l 8 -c 9 -p 1 msisensor-pro pro -d hg19_hp.tsv -t $BAM -c 1 -x 1 -b 4 -o regular -e $BED -i 0.1 -l 5
Thx,I will update this in next version and you are welcome to pull a request!
This is the problem with using a short context to scan a pattern, in this case the 5-mer (as in the default). When the contexts happen more than 1 time in the read, there is a problem.
Scan model is difficult to solve this kind of complex regions in genome now, but it has little effect on MSI detection . If you have some ideas, i am very glad to discuss with you!