please clarify: duplicated sequences
Closed this issue · 2 comments
ptrebert commented
Hi,
can you please clarify the following (v1.4 plus some fixes):
Quite a number of sequences in the disconnected
FASTA output are duplicated in the rDNA
and EBV
FASTA output files, but, e.g., not all sequences in the rDNA
output are in the disconnected
FASTA (only checked for one sample).
- is it safe to deduplicate the
disconnected
FASTA and reduce it to unique sequences? - can there be more duplicated sequences in the main (hap1, hap2) output FASTAs relative to, say, the
rDNA
output file that are just not obvious to spot because the contig naming is different?
$ grep unassigned-0001389 assembly.rdna.fasta
>unassigned-0001389
$ grep unassigned-0001389 assembly.disconnected.fasta
>unassigned-0001389
Thanks!
+Peter
skoren commented
The disconnected analysis and the rDNA/EBV analysis run independently. One uses only length and graph structure while the other uses identity to the provided reference and graph structure. So, it's possible for some nodes to meet criteria in both. They would be excluded from the final assembly.fasta file in both cases.
- Yes
- No, there shouldn't be full contig duplication between hap1 and hap2 but there would certainly be sequence duplication in homozygous regions.
ptrebert commented
ok, thanks for confirming.