Illegal character in reference sequence (W base and possibly other noncanonical bases)

Question

Illegal character in reference sequence (W base and possibly other noncanonical bases)

dantaki opened this issue 5 years ago · 2 comments

Running paragraph using hg38 genomes and ran into this error

Exception: chr3:90549400:<INV> illegal character in reference sequence

Traceback:

Traceback (most recent call last):
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 199, in add_record
    alt_sequence = ref_sequence[0] + reverse_complement(inv_ref)
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in reverse_complement
    return ''.join([complement[x] for x in seq[::-1]])
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in <listcomp>
    return ''.join([complement[x] for x in seq[::-1]])
KeyError: 'W'

$ samtools faidx /home/dantakli/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa chr3:90549400-91081922 | grep "W"
AAGTTTCTGAGAATCATTCTCTCTTGTTTTTCTGTGAAGWTATTGCCTTTTCTACCATAG

For now I will likely skip this SV, but just letting you all know that it seems that Paragraph doesn't support noncanonical bases.

Answer 1 · 2019-10-10T21:12:00.000Z

You're right. Noncanonical bases are not supported by Paragraph. Graph alignment to these bases needs to be more carefully treated.

Answer 2 · 2019-10-16T00:27:23.000Z

Perhaps it's best to warn the user and then skip the variant instead of crashing?

Thanks!