missing fields in outfile
Closed this issue · 5 comments
Hi Mike,
The tophit.m8 should have these columns:
query | target | evalue | pident | fident | nident | mismatch | qcov | tcov | qstart | qend | qlen | tstart | tend | tlen | alnlen | bits | qheader | theader | taxid | taxname | lineage
But the last 3 columns are empty: taxid taxname lineage. They will be important for parsing the contig taxonomy by kingdom, family etc.
From: rule PRIMARY_AA_taxonomy_assignment
Kathie
You should probably be using the tophits from the secondary output, unless you're talking about the contig annotations as opposed to the seqtable annotations?
I am talking about contig annotations.
Also, I found that the contigSeqTable.tsv has 14 columns for samples that DON'T have taxonomy, but 19 for those that do:
contigID seqID start stop len qual count CPM alnType taxMethod kingdom phylum class order family genus species baltimoreType baltimoreGroup
contig_1000 169-06-08-13-12_CAGATC:1:140311 11 252 241 17 NA NA NA NA NA NA NA NA
contig_1000 120-06-02-24-12_ATCACG:3:171191 214 406 192 0 3 2.989033237 nt LCA Viruses Cressdnaviricota Arfiviricetes Cirlivirales Circoviridae Circovirus Circovirus sp. ssDNA II
Hi Kathy,
That issue with the contigSeqTable is fixed in the dev branch and will be in the next release.
The rule PRIMARY_AA_taxonomy_assignment is part of the read-based annotations and it's only real purpose is to find sequences that look like a virus so that they can be analysed in the secondary search. You should take the annotations from the secondary searches. If you look at the secondary AA mmseqs directory, the file MMSEQS_AA_SECONDARY_tophit_aln_sorted
should have all the columns.
The direct contig annotations at the moment are a bit simplistic, but those files should be in ASSEMBLY/CONTIG_DICTIONARY/FLYE/results
. It's simplistic because it currently only uses the primary nt database, not the secondary nt database.
Should be fixed in new release