ksahlin/isONclust

Is it possible to use isONclust to help determine whether two reads came from the same isoform?

lauraht opened this issue · 15 comments

Hi Kristoffer,

I understand that each cluster generated by isONclust represents all reads that came from the same gene.

However, I was looking for a de novo tool/method that could determine whether two overlapped reads (that have a set of matching minimizers) came from the same isoform for ONT reads. So I was wondering if there is any way in which I may use isONclust to help determine this? Or is it possible to use isONclust in an alternative or modified way to cluster reads that came from the same isoform?

Since ONT reads have high error rates, those unmatched gaps/regions (with no matching minimizers) between two matched/overlapped reads may be caused by either sequencing errors or alternative splicing, so it seems challenging to determine whether two reads came from the same isoform for ONT reads. I would appreciate your advice.

Thank you very much!

Hi @lauraht,

You are right that it can be challenging distinguishing between structural differences and errors accumulated in regions using only the minimizes.

There is no tailored way to do this in isONclust. However, if runtime is not an issue for you (e.g., relatively small datasets), you can probably do what you want by setting

--mapped_threshold 0.95 --aligned_threshold 0.95

or even --mapped_threshold 0.99 or 1 to always invoke alignment. This will be slower but should work 'ok' because a base level alignment will always be invoked between a read and a "representative" (a representative is an isoform in this setting).

At the base level alignment, distinguishing between structural differences and read errors tend to be much easier. The --aligned_threshold 0.95 will treat the two sequences as the same isoform if there are no larger differences between the sequences (at most 5% is structurally different - longer indels). You can also set --aligned_threshold 0.97 or 99 here.

To go further, you can in this run-setting also call the experimental parameter --consensus to make isONclust produce a consensus sequence of the reads representing the same isoform. This is however experimental and has not been evaluated.

Best,
Kristoffer

Hi Kristoffer,

Thank you so much for your advice! I really appreciate it.

I will give it a try then.

Thank you very much again!

Hi Kristoffer,

I tried a dataset with 450 ONT reads. This dataset contains a query read (let’s call it read-1) and its matched reads (449 matched reads). These matched reads were obtained by running minimap2 read-to-read overlaps (-x ava-ont). Minimap2 reports the number of matching minimizers for each read-pair, so I got the numbers of matching minimizers between read-1 and each of its matched reads.

I first tried --mapped_threshold 0.95 --aligned_threshold 0.95.
“Passed alignment criteria” is always 0. But there are many “Passed mapping criteria”.
There are 57 clusters larger than 1. And read-1 is in cluster 5 that contains 15 reads.

I then tried --mapped_threshold 1.0 --aligned_threshold 0.95 to always invoke alignment.
“Passed mapping criteria” is always 0 as expected. There is only 1 “Passed alignment criteria”.
So 449 clusters are singletons.

I then tried --mapped_threshold 1.0 --aligned_threshold 0.9.
“Passed mapping criteria” is always 0. There are some (15) “Passed alignment criteria”.
There are 10 clusters larger than 1, but their cluster sizes are just 2.

So I tried --mapped_threshold 1.0 --aligned_threshold 0.8.
“Passed mapping criteria” is always 0. But there are many “Passed alignment criteria”.
There are 49 clusters larger than 1. And read-1 is in cluster 3 that contains 19 reads.

I also used --ont in all above runs.

My goal is to identify a bunch of matched reads that came from the same isoform as read-1 did.
Which of the above settings do you think makes more sense for this purpose?

I further looked into those reads that are in the same cluster as read-1. I found that in both the first setting (0.95, 0.95) and the last setting (1.0, 0.8), the reads that share the same cluster with read-1 have medium or low numbers of matching minimizers with read-1 (the numbers of matching minimizers were obtained from minimap2). However, those reads that have the highest numbers of matching minimizers with read-1 are not in the same cluster as read-1. I was wondering why this is the case? I thought the matched reads with more matching minimizers with read-1 should be more likely in the same cluster as read-1, since they are more likely to come from the same isoform as read-1. So I don’t quite understand why the result is a bit opposite. I would appreciate your advice.

I am attaching the fastq file of these 450 reads. The first read in the file is read-1. read1_matched_reads.fastq.gz

Thank you very much for your help!

I think I know what is going on.

First off, isONclust does not hande reverse complements, therefore, it will put all the forward oriented reads in one cluster and reverse in the other cluster. I should implement an option to merge them.. This is part of why you should get at least 2 clusters with isONclust. I did get 2 clusters with your data running

./isONclust --fastq read1_matched_reads.fastq --k 10 --w 11 --outfolder read1_matched_reads_out/

this gives 2 large clusters (cluster 0 and 1) and some singletons. I ran below comand to produce fastq file of each separate cluster

./isONclust write_fastq --fastq read1_matched_reads.fastq --clusters  read1_matched_reads_out/final_clusters.tsv --outfolder read1_matched_reads_out/fastq/

For the parameters I suggested, you probably get very fragmented results because your reads contain the poly-A tails and adapters/barcodes after the poly-A sequences. The poly-A tails are of very different length (see your fast[q/a] files). Another way to see this is by taking an example read of each formed cluster and aligning them to human using blat.

So, I would suggest running pychopper to remove the poly-A and the barcodes. This should result in higher similarity. You can also run isONcorrect on the two/four largest clusters after pythopper to get even higher identity.

Also, your dataset is small enough to inspect by eye. If you convert the fastq files of isONclust cluster 0 and 1 from above commend you can multi-align the reads in either cluster with seaview. This helps intuition.

I also tried --mapped_threshold 1.0 --aligned_threshold 0.7 --k 10 --w 11. This gives 4 large clusters (two sequences represented as fw and revers strands) and some singletons.

As for the aligned_threshold in isONclust. isONclust calculates aligned coverage w.r.t. the cluster representative.

true transcript:     ----------------------------------
read1:                    ----------------------------  (assume representative)
readX:               --------------------------------

In above scenario readX will have a alignment coverage say 0.85 because read1 does not cover readX in the start of readX.

If we flip the scenario (see below figure), the coverage of read1 to readX may be 0.95.

true transcript:     ----------------------------------
readX:               --------------------------------  (assume representative)
read1:                    ---------------------------- 

The coverage gets worse if you have differences in ends (poly-A tail lengths /barcodes)

true transcript:     ----------------------------------
read1:        XXXXXXXXX----------------------------  (assume representative)
readX:     XXXXXXXXXX--------------------------------

Above illustrates X is different end sequences from polyA tails etc.

Also, about the number of shared minimizers: it is difficult to state anything about this.

You can try a denser sampling of minimizes (as proposed in the previous message). Your long poly-A (poly-T) tails may get you in trouble here though, as they may all on none be selected as minimizes (depending on the implementation in minimap2 and isONclust - e.g. masking frequent k-mers etc.)

Hi Kristoffer,

Thank you so much for your detailed explanation and advice!

I will run pychopper to remove the poly-A and the barcodes then.

I saw that you used --k 10 --w 11 instead of --ont that is --k 13 --w 20. I was wondering if there is any special consideration about this choice of k and w values besides a denser sampling of minimizers?

If I use --mapped_threshold 1.0 --aligned_threshold 0.7 to always invoke alignment, do the k and w values for minimizers still matter?

According to what you explained about the alignment coverage w.r.t. the cluster representative, it seems that the clustering result may be impacted if different reads are used as the cluster representative. In this case, should I just use a looser --aligned_threshold (e.g. 0.7 or 0.75) to cluster the reads from the same isoforms?

Thank you very much!

Hi @lauraht,

The k and w are used to find initial candidate clusters - so yes, they matter. An alignment won't be tested between a read and representative that has no (or very few) shared minimizes. I'm not sure whether it has any effect in your data but since your dataset is so small, you can afford very low k and dense sampling with low w.

yes, a looser aligned_threshold is probably good now that I saw what your data looked like.

Best,
K

Thank you so much Kristoffer!

I just wanted to confirm one thing with you: since w cannot be smaller than k, the definition of w in isONclust is a bit different from that in minimap2, right? In minimap2 w specifies a window of w consecutive k-mers, while in isONclust w specifies the actual window length, right? So --k 10 --w 11 means the window has 2 consecutive k-mers, and --k 13 --w 20 means the window has 8 consecutive k-mers, right?

It seems that pychopper identifies, orients, and trims full-length cDNA reads. I need to keep and trim all reads including both full-length and non-full-length reads. I also have a dataset of direct RNA reads. So I was wondering if you know some good alternative tool that does trimming on all reads for Nanopore?

I also looked into porechop, which seems to trim adapters for Nanopore reads without limiting to full-length reads. But it seems that porechop only trims adapters (it did not mention poly-A tails). And porechop is no longer maintained, so it might be a bit out of date.

Thank you so much for all the help!

Yes, you are right, it is defined differently in isONclust - in the way you are describing it.

Unfortunately, I don't know of such a tool. Someone should write one. Perhaps reach out to the author of porechop to see if it supports your scenario?

Hi Kristoffer,

Currently this small dataset (450 reads) took ~10 seconds to complete running isONclust by using 85 cores in parallel, which is fast. I have 800,000 this kind of datasets with about similar sizes (some have more reads while some have fewer), so the estimated runtime of finishing all these datasets would be ~92.6 days, which seems to be not feasible with our current computing resources.

If I could reduce the runtime to 1-2 seconds per dataset, it would be feasible. In fact, I do not need to cluster all the reads in this dataset; I only need to find out which reads are in the same cluster as read-1 (the read of interest). In this case, if I treat read-1 as the representative of a cluster and only find out which reads belong to this single cluster by using the alignment threshold, I guess the runtime would be reduced, right? I was wondering if there is a way to use isONclust for this reduced usage? Or would it be possible to use some part of isONclust for doing this partial clustering?

Another question is, given that a dataset with 450 reads took ~10 seconds to do full clustering with always invoking the alignment (using 85 cores), what would be the estimated runtime for a dataset of 800,000 reads for doing full clustering with always invoking the alignment (using 85 cores)?

Thank you so much for all your help!

Hi @lauraht,

I suggest you run each instance with one core and instead start multiple isONclust jobs (85 if 85 cores) in parallel. There is a lot of overhead starting 85 processes (cores) with python. I wouldn't advise parallelizing any dataset under 100k reads.

In fact, I ran

./isONclust --fastq read1_matched_reads.fastq --k 10 --w 11 --outfolder read1_matched_reads_out/ --t X

with X=1 and X=4. Running with X=1 (1.2 sec) is actually 2-3 times faster compared to X=4 (~2.8 secs). I'm assuming 85 cores is even slower (~11 sec as you mention).

If your dataset above is an average dataset, and you can successfully parallelize the calls to isONclust (using e.g. shell script or similar), then I get an approximate runtime of (800,000*1.2/85)/3600 ~= 3hours. Even if you do not parallelize the job calls over your cores and run isONclust sequentially, it would be ~11 days.

Best,
Kristoffer

Thank you so much Kristoffer!

I tried --t 1, and it indeed took only 1.16 seconds for --mapped_threshold 1.0 --aligned_threshold 0.7 --k 10 --w 11, and took 1.54 seconds for --mapped_threshold 1.0 --aligned_threshold 0.75 --k 10 --w 11. Really fast!

I also need to subset the fastq file for those reads matched with the read of interest from the original 800,000 reads fastq file to form the small dataset in order to run isONclust. This subsetting process also took a couple of seconds, and it is sequentially performed for 800,000 times when reading/processing the minimap2 output for 800,000 reads. Each formed small dataset is for each of 800,000 reads.

So I was considering an alternative. Instead of performing isONclust on each of 800,000 small datasets, alternatively I could run isONclust on all 800,000 reads together. I think running isONclust on the small dataset would be easier to find the reads from the same isoform, as those reads in the small dataset were already identified as matched reads to the read of interest by minimap2 --- from the isONclust output I could then further identify those reads that came from the same isoform as the read of interest. Now, if I run isONclust on all 800,000 reads together with the same alignment threshold ( --mapped_threshold 1.0 --aligned_threshold 0.75), would the cluster that each read of interest belongs to in this case be similar to the cluster that each read of interest belongs to when running isONclust on each of those 800,000 small datasets (formed from its matched reads)?

I started running isONclust on all 800,000 reads together with --mapped_threshold 1.0 --aligned_threshold 0.75 --k 10 --w 11 --t 85. It has been running for 32 hours and is currently still running. I was wondering what the approximate time complexity of isONclust is? Is it quadratic or higher order?

I would appreciate your advice and thank you so much for all the help!

Because of the relatively stringent criteria that you apply (--mapped_threshold 1.0 --aligned_threshold 0.7), isONclust will form a lot of distinct but similar representatives. For each new read in the clustering, all representatives sharing some similarity with the current read needs to be evaluated, this will take a lot of time (quadratic complexity is the worse case). In a very handwaving fashion; isONclust has O(nm) complexity where n is number of reads and m is the number of representatives. isONclust thrives when there are relatively few representatives (small m).

Perhaps you could run with default parameters on mapped_threshold and aligned_threshold (and with e.g. --k 12 --w 15) to cluster the reads into gene families, and then cluster with stringent parameters each isONclust "gene family" cluster?

Hi Kristoffer,

Thank you so much for your detailed explanation and suggestion!

One thing I noticed is that 85 cores were only used at the beginning; later on just 3-4 cores were used, and currently it is running with only 2 cores.

Another thing is, I was wondering if it would be possible to add a --shortname option for isONclust to output the read name with “the read id only” instead of the entire @ line (currently the entire @ line is concatenated together with “_” in the output file final_clusters.tsv)? This would save some runtime in post-processing the output file. Since the @ line also contains “_” in some part of the original text, the added “_” cannot be used as a delimiter to convert the long string of the read name back to its original form. So if it is possible to add this --shortname option, it would be really helpful. If you want me to submit this enhancement as a separate issue, please let me know.

Thank you very much for your help!

This is expected. reducing by 2 each clustering pass. Perhaps the optimal number of cores for your data is either 16 or 8 (or at least an even power of 2).

Ok, I will consider this. Yes, please open a separate issue, you can just copy paste the text is sufficient.

Thanks!