bcgsc/arks

barcodes not in the barcode multiplicity file?

Closed this issue · 3 comments

Asutu commented

Hi,

I'm running into a warning with arks saying WARNING:: Your chromium read file has 13071618 read pairs that have barcodes not in the barcode multiplicity file.Cumulative memory usage: 4621292, but my understanding was that the barcode multiplicity file was generated from the read file itself. I'm probably not understanding something in arks, because this warning is a bit cryptic to me.

I'm also seeing that a large chunk of reads are being skipped (discarded?) by arks because apparently they don't have a good contig (Skipped reads pairs without a good contig: 162242712). Is this expected by arks? and would it make sense to tune the parameters to include more reads in the analysis?

I'm running arks with default parameters except specifying a minimum contig length of 1kb. The full command is:

arks-make arks time=1 draft=$draft reads=$reads threads=8 z=1000 k=30

Thanks,
Pedro

arks.log

Hi Pedro,

Don't worry about this warning -- I suspect it is just due to your input read set having a number of reads that do not have an associated barcode. For reference, I saw this line in a recent run of ARKS:

WARNING:: Your chromium read file has 27759471 read pairs that have barcodes not in the barcode multiplicity file.Cumulative memory usage: 1452348

And there are exactly that number of read pairs that do not have associated barcodes

[lcoombe@hpce705 Tigmint-ARKS]$ gunzip -c chromium.fq.gz |grep "HISEQ" |grep -v "BX:Z:" |wc -l
55518942
[lcoombe@hpce705 Tigmint-ARKS]$ echo $(( 55518942/2 ))
27759471

I do agree that the warning itself is a little bit cryptic and we could be more clear about if the barcode is not in the provided multiplicity file or whether the read pair just doesn't have a barcode at all.

And yes, it is also expected that a good number of reads will be marked as not having a 'good contig'. This can be due to a number of reasons, including both reads in a read pair not mapping to the same contig, or the jaccard index of a read pair not being above the threshold for any contig.

As for your parameters, they look fine to me except you could also try a slightly higher k -- I haven't run ARKS with a k-mer size of less than 40. I do find that is a good parameter to do a sweep on -- I find a different optimal k depending on the input assembly.

Hope that helps!
Lauren

Asutu commented

Hi Lauren,

many thanks, it was really helpful. I'm now testing with other ranges of k to see if there are improvements.

I'll close this issue now as my questions have been addressed.

I'm glad that was helpful! Just a heads up too - we clarified that warning message in 41694a5.