Illumina/paragraph

Illegal character in reference sequence (W base and possibly other noncanonical bases)

dantaki opened this issue · 2 comments

Running paragraph using hg38 genomes and ran into this error

Exception: chr3:90549400:<INV> illegal character in reference sequence

Traceback:

Traceback (most recent call last):
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 199, in add_record
    alt_sequence = ref_sequence[0] + reverse_complement(inv_ref)
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in reverse_complement
    return ''.join([complement[x] for x in seq[::-1]])
  File "/home/dantakli/paragraph/lib/python3/grm/vcfgraph/vcfgraph.py", line 436, in <listcomp>
    return ''.join([complement[x] for x in seq[::-1]])
KeyError: 'W'
$ samtools faidx /home/dantakli/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa chr3:90549400-91081922 | grep "W"
AAGTTTCTGAGAATCATTCTCTCTTGTTTTTCTGTGAAGWTATTGCCTTTTCTACCATAG

For now I will likely skip this SV, but just letting you all know that it seems that Paragraph doesn't support noncanonical bases.

You're right. Noncanonical bases are not supported by Paragraph. Graph alignment to these bases needs to be more carefully treated.

Perhaps it's best to warn the user and then skip the variant instead of crashing?

Thanks!