Illegal character in reference sequence (W base and possibly other noncanonical bases)
dantaki opened this issue · 2 comments
dantaki commented
Running paragraph using hg38 genomes and ran into this error
Exception: chr3:90549400:<INV> illegal character in reference sequence
Traceback:
Traceback (most recent call last):
File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 199, in add_record
alt_sequence = ref_sequence[0] + reverse_complement(inv_ref)
File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in reverse_complement
return ''.join([complement[x] for x in seq[::-1]])
File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in <listcomp>
return ''.join([complement[x] for x in seq[::-1]])
KeyError: 'W'
$ samtools faidx /home/dantakli/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa chr3:90549400-91081922 | grep "W"
AAGTTTCTGAGAATCATTCTCTCTTGTTTTTCTGTGAAGWTATTGCCTTTTCTACCATAG
For now I will likely skip this SV, but just letting you all know that it seems that Paragraph doesn't support noncanonical bases.
traxexx commented
You're right. Noncanonical bases are not supported by Paragraph. Graph alignment to these bases needs to be more carefully treated.
dantaki commented
Perhaps it's best to warn the user and then skip the variant instead of crashing?
Thanks!