About the read order in each cluster in the final_clusters.tsv file

Question

About the read order in each cluster in the final_clusters.tsv file

Closed this issue 3 years ago · 6 comments

Hello!

I have a question regarding the read order in each cluster in the final_clusters.tsv file and would appreciate your advice.

My understanding is that isONclust first sorts the reads such that longer reads with higher quality scores appear earlier, and then processes the reads one-by-one in the sorted order. So in the final_clusters.tsv file, are the reads in each cluster also reported in the sorted order (i.e. longer reads with higher quality scores appear earlier)?

Thank you very much!

Answer 1 · 2022-02-15T07:15:25.000Z

Hi @lauraht

No, from what I recall it is not guaranteed as the clustering is iterative if using several threads - so clusters might be merged in later iterations. Also, I'm not sure order is guaranteed even for single-core clustering.

One way to investigate this is to compare the reads reported in final_clusters.tsv to their order in sorted.fastq (ordered) in the output folder using eg grep with print row number.

Answer 2 · 2022-02-17T05:24:16.000Z

Hi Kristoffer,

Thank you so much for your advice!

You also mentioned that when using multiple threads, clusters might be merged. I was just wondering what is the criterion to merge clusters?
For the same dataset, would the clustering result by using multiple threads be the same as the clustering result by using a single threshold regardless of the reporting order?

Thank you very much!

Answer 3 · 2022-02-17T07:39:24.000Z

Hi @lauraht,

It is not guaranteed to be the same output for single and multiple cores. For the results in the publication, a single core/thread was always used.

Multithreading was a feature that I added after to speed up some datasets. I observed relatively minor differences in the quality, but it was not identical output.

Answer 4 · 2022-02-17T08:44:51.000Z

Hi Kristoffer,

Thank you very much for your information!

But the quality differences are minor, in other words, the clustering result would not be deteriorated due to using multithreading, is that right?

I have a dataset with 3 million full-length reads, so I have to use multithreading. I found that the top ~10 clusters created by isONclust are very large (~18,000 reads in each cluster). Based on your experience, do you think these sizes are reasonable? Since isONclust does gene-level clustering, do you think these many reads for a single gene are normal in ONT RNA-seq (cDNA)?

I used 64 cores for multithreading. These large cluster sizes should not be due to using 64 threads, right?

Thank you so much!

Answer 5 · 2022-03-16T21:36:59.000Z

Completely forgot about this Q, sorry about that.

Yes, the differences should be minor.

~18,000 sounds completely reasonable.

Yes, it should not be because of 64 threads. 18k sounds reasonable. I recall that I heard that gene expression follows a power law distribution roughly, in that case, it would make sense (also sequencing amplification is a bit biased etc). All in all sounds reasonable.

Answer 6 · 2022-03-24T05:51:09.000Z

Thank you very much Kristoffer!