ksahlin/isONclust

Working with huge datasets

timyerg opened this issue · 7 comments

Hello!
Thank you for the tool. I run it with some samples, and the results are promising.
Currently, I am testing this tool for one pipeline that should handle many ONT barcoded samples.
I have already found that one needs to concatenate samples in one file to cluster reads from different samples. The issue is that it may easily be a couple of hundred samples, so I am looking for ways to reduce memory requirements.

So, I have the following questions:

  1. Is it possible to add support for gzipped files?
  2. What if instead of clustering all pooled reads from many samples, I cluster sample-by-sample, using --consensus to get representative sequence (or use my own scripts to do that), pool representative sequences along samples, and cluster them again to create clusters that are shared among all samples?
  3. I got following header for consensus sequences: ">consensus_cl_id:448_total_supporting_reads:20057". Hovewer, for this test I got only cluster IDs up to 104, and 448 definitely not a cluster id. Am I misreading it?

I am especially interested in question 2; please share your opinion on that approach.

Best,

Hi @timyerg ,

Are the ONT barcoded samples transcriptomic reads or amplicons from genomic regions?

  1. It's possible, but it is not guaranteed that it will reduce peak memory - as we may need an uncompressed version of the reads at the end of the clustering. (CC @aljpetri who is working on a continuation of isONclust)
  2. Yes, sounds good. What you describe is almost exactly how multi-threaded clustering works with isONclust (clustering batches of reads, then clustering representatives between clusters), so I think it's a good suggestion to overcome the memory issue. Note that the coverage of a sequence from one sample may be low, so you probably want to set a low coverage threshold for the consensus sequences to be outputted.
  3. I cannot explain this. Looking at lines 116-137 in
    for c_id, all_read_acc in sorted(clusters.items(), key = lambda x: (len(x[1]),representatives[x[0]][5]), reverse=True):
    as well as in the function detect_reverse_complements that this code segment calls, I cannot see that the variable c_id goes beyond the current cluster IDs from the clustering step. Is it possible this file is from an older clustering attepmt that was not removed before a new clustering was run?

Are the ONT barcoded samples transcriptomic reads or amplicons from genomic regions?

Yes, they are amplicons

It's possible, but it is not guaranteed that it will reduce peak memory

Yes, I suspected that, but wanted to safe some space. But this part is minor, I tested it with 200 samples and merged file was not very huge.

Yes, sounds good. What you describe is almost exactly how multi-threaded clustering works with isONclust (clustering batches of reads, then clustering representatives between clusters), so I think it's a good suggestion to overcome the memory issue. Note that the coverage of a sequence from one sample may be low, so you probably want to set a low coverage threshold for the consensus sequences to be outputted.

Initially, I thought it was a crazy idea. Thank you for your answer!
It turned out that your tool handles big datasets better than I expected, so I will perform more tests to decide whether to pool everything or run it sequentially.

Is it possible this file is from an older clustering attepmt that was not removed before a new clustering was run?

No, I rerun it a couple of times with deleted outputs. But I will subsample clusters anyway to avoid running spoa with too many sequences (>500 000), so I will just run it or another algorithm outside this tool.

Thank you for your quick response!

By the way, is it OK if I publish a pipeline/tool with your tool as part of the pipeline? Of course, with proper references.

Alright! You may want to checkout https://github.com/ksahlin/NGSpeciesID (a spin-off from isONclust for amplicon data) which could be more specialised to what you are considering. You can for example control how many sequences you want to run with spoa with --max_seqs_for_consensus as well as many other amplicon related things.

By the way, is it OK if I publish a pipeline/tool with your tool as part of the pipeline? Of course, with proper references.

Yes.

Wow, that's great!
Thank you

Sorry for asking more questions,

I checked NGSpeciesID, and it looks like a great tool, but there are several steps that I would like to change.
Does NGSpeciesID use the same clustering algorithm as isONclust? If yes, I would like to use isONclust since it would be easier for me to work with its output.

If clustering algorithm is mostly the same, could you please advise me if I should change theses parameters to optimize it for amplicons:

--min_shared (default: 5)
--mapped_threshold (default: 0.7)
--aligned_threshold (default: 0.4)
--batch_type "total_nt", "nr_reads", or "weighted" (default: total_nt)
--min_fraction (default: 0.8)
--min_prob_no_hits (default: 0.1)

Or clustering algorithms are different and I should run NGSpeciesID to get clusters and then use clusters from the output?

Does NGSpeciesID use the same clustering algorithm as isONclust?

Yes.

If clustering algorithm is mostly the same, could you please advise me if I should change theses parameters to optimize it for amplicons:

You can check which parameters NGSpeciesID use - I think they have the exact same values for the parameters you list, which means you don't have to change anything.