output TSV column(s) for missing bases at beginning and end of sequence?
AngieHinrichs opened this issue · 1 comments
Sorry if I have missed this among nextclade CLI's many output columns from --output-tsv, but is there a column that tells how many bases are missing at the beginning and end of a sequence? For example, I have a reference.fasta (NC_007362) of 1760 bases, and a test sequence with 1704 bases. From the fasta output's -
characters, I can see that there are 21 bases missing at the beginning and 32 bases missing at the end, as well as a deletion of 1 base and a deletion of 3 bases:
>Consensus_SRR28752446_HA_cns_threshold_0.5_quality_20
---------------------ATGGAGAACATAGTACTACTTCTTGCAATAGTTAGCCTTGTTAAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAATTCGACAGAGCAAGTTGACACGATAATGGAAAAGAACGTCACTGTTACACATGCCCAAGACATACTGGAAAAAACACACAACGGGAAGCTATGCGACCTAAATGGGGTGAAGCCACTGATTTTAAAGGACTGCAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGCGACGAATTCATCAGAGTGCCGGAATGGTCTTACATAGTGGAGCGGGCTAACCCAGCTAATGACCTCTGTTACCCAGGGAGCCTCAATGACTATGAAGAACTGAAACACATGTTGAGCAGAATAAATCATTTTGAGAAGATTCAGATCATTCCCAAGAGTTCCTGGCCAAATCATGAAACATCACTAGGGGTGAGCGCAGCTTGTCCATACCA-GGGAGACCCTCCTTTTTCAGAAATGTGGTGTGGCTTATCAAAAAGAACGATGCATACCCAACAATAAAGATAAGCTACAATAATACTAATCGGGAAGATCTCTTGATACTGTGGGGGATTCATCATTCCAACAATGCAGAAGAGCAGACAAATCTCTACAAAAACCCAATCACCTACATTTCAGTTGGAACATCAACTTTAAACCAGAGGTTGGCACCAAAAATAGCTACTAGATCCCAAGTAAACGGGCAACGTGGAAGAATGGACTTCTTCTGGACAATCTTAAAACCAGATGATGCAATCCATTTCGAGAGTAACGGAAATTTCATTGCTCCAGAATATGCATACAAAATTGTTAAGAAAGGGGACTCGACAATTATGAAAAGTGGAGTGGAATATGGCCATTGCAACACCAAATGTCAAACCCCAGTAGGTGCGATAAATTCTAGTATGCCATTTCACAACATACATCCTCTCACCATTGGGGAATGCCCCAAATACGTGAAATCAAACAAGTTGGTCCTTGCGACTGGGCTCAGAAATAGTCCTCTAAGAGAAAAGAGAAGAAAA---AGAGGTCTGTTTGGGGCGATAGCAGGGTTTATAGAGGGAGGATGGCAGGGAATGGTTGATGGTTGGTATGGGTACCATCATAGCAATGAGCAGGGGAGTGGGTACGCTGCGGACAAAGAATCCACCCAAAAGGCAATAGATGGAGTTACCAATAAGGTCAACTCAATCATTGACAAAATGAACACTCAATTTGAGGCAGTTGGAAGGGAGTTTAATAACTTAGAAAGGAGGATAGAGAATTTGAACAAGAAAATGGAAGACGGATTCCTAGATGTCTGGACATATAATGCTGAACTTCTAGTTCTCATGGAAAACGAGAGGACTCTAGATTTCCATGATTCAAATGTCAAGAACCTTTACGACAAAGTCAGATTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGCTGTTTCGAATTCTATCACAAATGTGATAATGAATGTATGGAAAGTGTGAGAAATGGGACGTATGACTACCCTCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAAGAAATAAGCGGAGTGAAATTAGAATCAGTAGGAACTTACCAGATACTGTCAATTTATTCAACAGCGGCAAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAG--------------------------------
Nextclade CLI shows the two deletions in the deletions
output TSV column. The missing
column is empty, I suppose because there are no Ns in the sequence. Is there a column that will tell me about the 21 missing bases at the beginning and 32 at the end? Or should I just parse it out of the fasta? (So I can tell UShER that those bases are unknown.) Thanks!
Oops I see them now, alignmentStart and alignmentEnd. Never mind. :)