High level of duplicated protein sequences
hegardon opened this issue · 1 comments
Hi,
I am using PLASS (v4.687d7) on a set of metagenomes from ~100 cheese samples and it works very well, but still, I have some questions.
In each dataset a high level of protein sequences (on average 30%) are duplicated (with 100% identity and coverage). I understand that some sequences could be duplicated (originating from closely related species), but 30% seems to be quite high.
Another issue is the total amount of assembled amino acid. As an example, for an initial dataset of 18 million reads (2x150 bp paired-end reads, 2.7 Gbp in total), 7 million proteins are assembled (2e+9 aa in total, almost as much as the total amount of nucleotides, which means, to me, more amino acid than expected...).
Is there an explanation about these results ?
I am using PLASS with the following command (others parameters as default):
plass assemble METAG_R1.fastq.gz METAG_R2.fastq.gz METAG_out.fasta -e 0.001 --num-iterations 12 --filter-proteins 1 --remove-tmp-files 1
Thanks
Helene
Since Plass can reuse each read in every iteration. It tends to create a lot of variation that are not necessarily useful. We generally use mmseqs linclust
to remove fragments afterwards.