bcgsc/arks

Scaffolds with a single ambiguous (N) base

Closed this issue · 3 comments

Hello,
I get scaffolds showing up at the end of my assembly that contain a single N as their sequence.

eg.

scaffold4685,1,f4740Z1
N

This is in the *.scaffold.fa file after running arks+LINKS a second time after removing potential contaminant sequences. This isn't necessarily an issue since these scaffolds can easily be removed, but I'm curious to know why they are showing up and if it might be related to how I've processed my data. I'll outline the steps I've taken below in case it helps:

  1. I ran Supernova on my raw 10x data at 56x coverage
  2. I ran Tigmint+arks+LINKS using window=1000 span=20 k=61 m=30-6000 and a=0.5 with the Supernova assembly and raw 10x data.
  3. I BLASTed the assembly against the nt database using blastn to identify scaffolds with hits to contaminants
  4. I filtered scaffolds with contaminants from the assembly and similarly cleaned up the raw 10x data using a number of approaches (removed low quality regions using filterbytile (BBtools), and quality and adapter trimmed, removed sequencing artifacts, phiX sequence, known endosymbionts, and all other potential contaminants identified using Centrifuge against refseq bacterial, viral, human, fungal and protozoan genomes (all with BBduk).
  5. Lastly, I reran arks+LINKS using k=61 m=18-4000 and a=0.5 with the previous contaminant filtered arks+LINKS assembly and contaminant filtered 10x data from (4).

I only noticed the problematic scaffolds after attempting to run Sealer to fill gaps and got several warnings identifying these scaffolds (eg. abyss-sealer: Warning: sequence ends with an N: scaffold4667,1,f4752Z1). My justification for doing things as outlined above is that since 10xG recommends Supernova be run on raw data, contaminant removal had to occur after the initial assembly. Rerunning arks+LINKS after contaminant removal was an attempt to recover good contigs that may have been removed through prior misassembly with contaminant sequences.

Anyway, I'm particularly interested to know if this could have resulted from my contaminant filtering steps (eg. quality/adapter trimming).

Thanks,
Bryan

Hi @bbrunet,

My guess us that these pieces will be due to the Tigmint step - I've seen this before.
When Tigmint finds a putatively misassembled region, it will cut in 2 places (see the Tigmint paper for details about how it finds those cut sites). Once it makes the cuts, it will strip terminal Ns from the resulting pieces. If sequences will end up being empty, it will make the sequence a single 'N' to avoid corrupting the fasta file. You could double check that this is the case by looking for that sequence in the output of Tigmint. A simple step to get rid of those would just be to filter out sequences that are below a certain length.

Hope that helps!
Lauren

Thanks Lauren! I checked the Tigmint output and, just as you indicated, the single 'N's show up there too. This was good to know since there are several other scaffolds of short length that were also introduced at this point that I wasn't aware of.

Thanks again,
Bryan

Glad I could help! :)