Incorrect insertions positions with VCF
Opened this issue · 2 comments
Hello Nils,
I have been using DWGSIM version 0.1.14.
My goal is to simulate Insertion ranging from length 10 to 200, and being able to add specific sequences of insertion
First test simulation with VCF
In my first try, I used a VCF with -v
option and an insertion length of 198, the simulation gave me the correct location for my insertion after aligning the reads with BWA and called the variant with variant callers.
I only put one insertion in my VCF.
Second test of Simulation with VCF
My next try was to used different insertions length with the same workflow tested with my first try. I unfortunately noticed that none of the new insertions were correctly simulated.
- The genomic coordinates were shifted upward to the location originally present in the
VCF
, and the inserted bases got modified compared to my inputted ones. - the recaptured inserted sequences were not the one originally put in the
VCF
.
### Test simulation with BED
Based on previous issues described in this repo, I decided to give a shot to the BED file as well to simulate my insertions.
After trial and errors in making the BED
file, I got the message:
Error: insertion of length 36 exceeded the maximum supported length of 26
but when using the command dwgsim --help
, there is a note that mentions that the longest insertion supported is 4294967295, I cite:
Note: The longest supported insertion is 4294967295.
I tried with an insertion of 25 and it worked as expected (I shrank my original insertion to a length of 25 (26 = 1 + 25; i.e., the REF plus all the new 25 inserted bases)).
case 1
if the bed file has the following line,
chr13 28034117 28034143 GATCATATTCATATTCTCTGAAATCA INSERT
G is the REF at position 28034118, here the first G, and the inserted bases I want in are ATCATATTCATATTCTCTGAAATCA
, i.e. 25 bases. I get this output in the dwgsim's mutations.vcf file:
chr13 28034117 . A AAATCATATTCATATTCTCTGAAATC 100 PASS AF=1.0;pl=3;mt=INSERT
case 2
If I put the following line in the bed file,
chr13 28034118 28034144 GATCATATTCATATTCTCTGAAATCA INSERT
I get this output in the dwgsim's mutations.vcf file:
chr13 28034119 . A AGATCATATTCATATTCTCTGAAATCA 100 PASS AF=1.0;pl=3;mt=INSERT
case 3
If I put the following line in the bed file
chr13 28034118 **28034143** ATCATATTCATATTCTCTGAAATCA INSERT
I get this output in the dwgsim's mutations.vcf file:
chr13 28034119 . A AATCATATTCATATTCTCTGAAATCA 100 PASS AF=1.0;pl=3;mt=INSERT
Knowing that BED file is 0-based, the case 1 above should have been the correct case to use to get the insertion being inserted right after the G on position 28034118.
Impact of the SEED
value on the AF
and pl
values in the dwgsim
's mutations.vcf
file
I have been using the -F
to control the allele frequency specifically;
With using a BED file and -b
option, I noticed that the values of AF
were not the 0.5 all the time, but that value changes when using a different seed value; I woulf have expected not to be modified since I stiplated on the command line that I wanted AF=0.5 specifically using the -F
option; Unless I am mistaken on the correlation between the two values of -F
and AF
in the mutations output file.
Questions:
- Is there anything wrong with the way the INSERT line in the BED file is constructed?
- Is there any option that could make
dwgsim
working with a BED file and with insertions longer than 26? - Is there any option that should be adjusted to make it work with a VCF using
-v
? - Is
AF
only fixed and not dependent on the seed value when using the-v
option and not the-b
? - Is the longest insertion of 4 billion bases only supported when using a VCF? otherwise what is the max length of insertion supported when using a VCF?
- Could you let me know what would be the best way (i.e.
dwgsim
command) to simulate exactly these insertions? Same exact coordinate and same exact sequence as the one in the provided BED or VCF. Thanks.
I am providing an example of one of my VCF with an insertion of 81 bases.
The Reference genome was only the complete sequence of the chromosome 13.
Homo sapiens chromosome 13, GRCh38.p14 Primary Assembly
NCBI Reference Sequence: NC_000013.11
Example of dwgsim command
dwgsim -d 486 -s 95 -1 150 -2 150 -C 20.0 -y 0 -R 0 -X 0 -I 30 -n 0 -r 0.00000000000000001 -c 0 -M -S 2 -z 28034113 -o 1 -F 0.50 -b insertion81.bed GRCh38_chr13_complete_sequence.fa myoutPrefix
Let me know what further information you may need.
Thanks for your feedback,
Chris.
inputs_insertion.zip
I’ll be honest that I have very little time for open source unpaid support given my work and family commitments. My apologies in advance, and feel free to reach out to www.fulcrumgenomics.com if you’d like paid support, though we are also booked out pretty far. I’ll keep this open so I can get to it if I am able to come up for air.
Thanks for your quick and honest reply.