ArtPoon/pangolin

TypeError in datafunk

Opened this issue · 2 comments

Encountered the following exception while attempting to run a recent dump of the GISAID CoV database:

(pangolin) art@orolo:~/git/covizu/data$ datafunk sam_2_fasta           -s /home/art/git/covizu/data/reference_mapped.sam           -r /home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangolin-2.0.4-py3.6.egg/pangolin/data/reference.fasta           -o /home/art/git/covizu/data/post_qc_query.aligned.fasta           -t [265:29674]           --pad           --log-inserts 
Traceback (most recent call last):
  File "/home/art/miniconda3/envs/pangolin/bin/datafunk", line 8, in <module>
    sys.exit(main())
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/__main__.py", line 1010, in main
    args.func(args)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/subcommands/sam_2_fasta.py", line 87, in run
    trimend = trimend)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 269, in sam_2_fasta
    seq = get_seq_from_block(sam_block = one_querys_alignment_lines, rlen = RLEN, log_inserts = log, pad = pad)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 201, in get_seq_from_block
    seq_flat_no_internal_gaps = swap_in_gaps_Ns(block_lines_sites_list[0], pad = pad)
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/site-packages/datafunk/sam_2_fasta.py", line 172, in swap_in_gaps_Ns
    for x in re.findall(r_internal, seq):
  File "/home/art/miniconda3/envs/pangolin/lib/python3.6/re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

Modification from line 268 in datafunk/sam_2_fasta.py:

        try:
            seq = get_seq_from_block(sam_block = one_querys_alignment_lines, rlen = RLEN, log_inserts = log, pad = pad)
        except:
            print(query_seq_name)
            print(one_querys_alignment_lines)
            raise

yielded:

hCoV-19/pangolin/Guangxi/P4L/2017|EPI_ISL_410538|2017
[<pysam.libcalignedsegment.AlignedSegment object at 0x7f885a4b7ac8>]

So, yeah, let's not try to classify non-human genomes!

My guess is that reads that fail to align to the reference are stored as None objects in the AlignedSegment object. There should be an exception handler for such cases.