marbl/verkko

please clarify: duplicated sequences

Closed this issue · 2 comments

Hi,
can you please clarify the following (v1.4 plus some fixes):
Quite a number of sequences in the disconnected FASTA output are duplicated in the rDNA and EBV FASTA output files, but, e.g., not all sequences in the rDNA output are in the disconnected FASTA (only checked for one sample).

  1. is it safe to deduplicate the disconnected FASTA and reduce it to unique sequences?
  2. can there be more duplicated sequences in the main (hap1, hap2) output FASTAs relative to, say, the rDNA output file that are just not obvious to spot because the contig naming is different?
$ grep unassigned-0001389 assembly.rdna.fasta 
>unassigned-0001389
$ grep unassigned-0001389 assembly.disconnected.fasta 
>unassigned-0001389

Thanks!

+Peter

skoren commented

The disconnected analysis and the rDNA/EBV analysis run independently. One uses only length and graph structure while the other uses identity to the provided reference and graph structure. So, it's possible for some nodes to meet criteria in both. They would be excluded from the final assembly.fasta file in both cases.

  1. Yes
  2. No, there shouldn't be full contig duplication between hap1 and hap2 but there would certainly be sequence duplication in homozygous regions.

ok, thanks for confirming.