nextstrain/nextclade

output TSV column(s) for missing bases at beginning and end of sequence?

AngieHinrichs opened this issue · 1 comments

Sorry if I have missed this among nextclade CLI's many output columns from --output-tsv, but is there a column that tells how many bases are missing at the beginning and end of a sequence? For example, I have a reference.fasta (NC_007362) of 1760 bases, and a test sequence with 1704 bases. From the fasta output's - characters, I can see that there are 21 bases missing at the beginning and 32 bases missing at the end, as well as a deletion of 1 base and a deletion of 3 bases:

>Consensus_SRR28752446_HA_cns_threshold_0.5_quality_20
---------------------ATGGAGAACATAGTACTACTTCTTGCAATAGTTAGCCTTGTTAAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAATTCGACAGAGCAAGTTGACACGATAATGGAAAAGAACGTCACTGTTACACATGCCCAAGACATACTGGAAAAAACACACAACGGGAAGCTATGCGACCTAAATGGGGTGAAGCCACTGATTTTAAAGGACTGCAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGCGACGAATTCATCAGAGTGCCGGAATGGTCTTACATAGTGGAGCGGGCTAACCCAGCTAATGACCTCTGTTACCCAGGGAGCCTCAATGACTATGAAGAACTGAAACACATGTTGAGCAGAATAAATCATTTTGAGAAGATTCAGATCATTCCCAAGAGTTCCTGGCCAAATCATGAAACATCACTAGGGGTGAGCGCAGCTTGTCCATACCA-GGGAGACCCTCCTTTTTCAGAAATGTGGTGTGGCTTATCAAAAAGAACGATGCATACCCAACAATAAAGATAAGCTACAATAATACTAATCGGGAAGATCTCTTGATACTGTGGGGGATTCATCATTCCAACAATGCAGAAGAGCAGACAAATCTCTACAAAAACCCAATCACCTACATTTCAGTTGGAACATCAACTTTAAACCAGAGGTTGGCACCAAAAATAGCTACTAGATCCCAAGTAAACGGGCAACGTGGAAGAATGGACTTCTTCTGGACAATCTTAAAACCAGATGATGCAATCCATTTCGAGAGTAACGGAAATTTCATTGCTCCAGAATATGCATACAAAATTGTTAAGAAAGGGGACTCGACAATTATGAAAAGTGGAGTGGAATATGGCCATTGCAACACCAAATGTCAAACCCCAGTAGGTGCGATAAATTCTAGTATGCCATTTCACAACATACATCCTCTCACCATTGGGGAATGCCCCAAATACGTGAAATCAAACAAGTTGGTCCTTGCGACTGGGCTCAGAAATAGTCCTCTAAGAGAAAAGAGAAGAAAA---AGAGGTCTGTTTGGGGCGATAGCAGGGTTTATAGAGGGAGGATGGCAGGGAATGGTTGATGGTTGGTATGGGTACCATCATAGCAATGAGCAGGGGAGTGGGTACGCTGCGGACAAAGAATCCACCCAAAAGGCAATAGATGGAGTTACCAATAAGGTCAACTCAATCATTGACAAAATGAACACTCAATTTGAGGCAGTTGGAAGGGAGTTTAATAACTTAGAAAGGAGGATAGAGAATTTGAACAAGAAAATGGAAGACGGATTCCTAGATGTCTGGACATATAATGCTGAACTTCTAGTTCTCATGGAAAACGAGAGGACTCTAGATTTCCATGATTCAAATGTCAAGAACCTTTACGACAAAGTCAGATTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGCTGTTTCGAATTCTATCACAAATGTGATAATGAATGTATGGAAAGTGTGAGAAATGGGACGTATGACTACCCTCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAAGAAATAAGCGGAGTGAAATTAGAATCAGTAGGAACTTACCAGATACTGTCAATTTATTCAACAGCGGCAAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAG--------------------------------

Nextclade CLI shows the two deletions in the deletions output TSV column. The missing column is empty, I suppose because there are no Ns in the sequence. Is there a column that will tell me about the 21 missing bases at the beginning and 32 at the end? Or should I just parse it out of the fasta? (So I can tell UShER that those bases are unknown.) Thanks!

Oops I see them now, alignmentStart and alignmentEnd. Never mind. :)