How to proceed with the output after polishing the data?

Question

How to proceed with the output after polishing the data?

yeroslaviz opened this issue 2 years ago · 6 comments

I'm not sure I really understand the goal of this tool.

From what i ran up to now, i have created a "clean" fastq file of the consensus transcripts from multiple clusters. I don't quite get how i can find out how many clusters I have. The different files in between have various row numbers which don't really fit together.
the last file after polishing has 700 rows. Does this means that I have 175 clusters?

And now

What are this clusters? How can I continue with this?
Should this fastq be analyzed as a normal fastq file like it came out of the sequencing machine?

How can I match the clusters to genes or transcripts?

thanks

Assa

Answer 1 · 2022-08-23T09:55:06.000Z

Hi Assa,

The outputs of polish step are transcriptome. The cluster IDs from the cluster and error correction steps are the same. And the polish step generates one transcriptome for each cluster.

the last file after polishing has 700 rows. Does this means that I have 175 clusters?
Yes. The clusters' IDs and numbers are provided in each header.

Thanks,
Eileen

Answer 2 · 2022-08-23T09:58:54.000Z

Thanks for the fast response.

What can I now do with these clusters?

Does it help re-mapping the (E.g. minimap2) to a reference?
How can I gain further information from this output?

Answer 3 · 2022-08-23T10:49:24.000Z

The output is a transcriptome, grouped into "gene bags", with the associated quantification (= number of reads giving rise to each transcript). The things you could do depend of the sample and organism you're looking at. You could: * map the transcriptome to a reference genome or annotation to identify potential new transcripts * analyse the sequence of the transcripts to identify known and novel functional domains, repeat elements, etc. * Study the different expression of transcripts across conditions, possibly in association with the novel sequences or interesting functions Are you looking at multiple conditions? What organism are you looking at? E.

…

On Tue, 23 Aug 2022 at 19:59, yeroslaviz ***@***.***> wrote: Thanks for the fast response. What can I now do with these clusters? Does it help re-mapping the (E.g. minimap2) to a reference? How can I gain further information from this output? — Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCZKBYM2NLCRZFDV6Z3ZPLV2SOGTANCNFSM57KVSZ3A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

Answer 4 · 2022-08-23T11:22:46.000Z

As this was only a test run with only two samples, we don't really have different replica or conditions yet, but the organism is C. elegans.

If I understand the tool correctly, each cluster should give back one transcript, or at least a group of transcripts unique to this cluster (Can one transcript be found in two or more clusters?).

This seems to me to be a very complicated method to do a differential expression analysis.

Creating the transcriptome for two different samples will definitely won't create the same consensus reads for each cluster (or gene bag as you call it). So how can you compare them without re-mapping it.

The logical next-step to me is to map the consensus transcriptome to a reference genome, but it wouldn't have the depth of the original fastq file, as a lot of the reads are gone and there are no real qualities in the fastq anymore (they are all "K").

Not really sure what you mean by mapping to annotation. Blasting each transcript? that's also a lot of effort.

To me, the tool is missing the possibility to map each transcript to a gene, a transcript or something similar to quantify the results.
Sorry for for being negative, but I would like to understand how to gain as much knowledge from this tool as possible.

thank
Assa

Answer 5 · 2022-08-23T11:42:40.000Z

Hi Assa, I agree that for C. elegans, you many not need to build a reference-free transcriptome, since you have a very good genome reference. Unless, your hypothesis is that there might transcripts produce from loci that are either rearranged or polymorphic or mis-assembled in your particular individuals, or under certain conditions.... You could uncover them this way.... or use it as prototype for studying one lesser known C. elegans relative....\ some answers below:

As this was only a test run with only two samples, we don't really have different replica or conditions yet, but the organism is C. elegans. If I understand the tool correctly, each cluster should give back one transcript, or at least a group of transcripts unique to this cluster (Can one transcript be found in two or more clusters?).

The output is a transcriptome, where each transcript is built from a cluster of transcripts. Transcripts are organised in gene-clusters, like in a genome-based annotation, where each gene locus can have multiple transcripts. However, because of sequence similarity, with rattle, very similar gene loci may end up in the same cluster. Each gene cluster can have one or more transcripts, but each transcript belongs to only one gene cluster. Also, each read is placed in only one transcript.

This seems to me to be a very complicated method to do a differential expression analysis.

That's right. the reference-based approach might be more effective in this case.

Creating the transcriptome for two different samples will definitely won't create the same consensus reads for each cluster (or gene bag as you call it). So how can you compare them without re-mapping it.

RATTLE can cluster reads from two or more samples, so that you can create one single transcriptome, and then extract the read counts for each transcript that belongs to each of the original samples. That enables the comparison across samples and conditions.

The logical next-step to me is to map the consensus transcriptome to a reference genome, but it wouldn't have the depth of the original fastq file, as a lot of the reads are gone and there are no real qualities in the fastq anymore (they are all "K").

This step is equivalent to mapping a list of transcript sequences (no read copies) to a genome. It won't be quantitative.

Not really sure what you mean by mapping to annotation. Blasting each transcript? that's also a lot of effort.

For the computer ;-) Tools like BUSCO (https://busco.ezlab.org/) can annotate a transcriptome for potential gene functions (based on similarity with proteins)

To me, the tool is missing the possibility to map each transcript to a gene, a transcript or something similar to quantify the results.

RATTLE is a tool for reference-free transcriptome reconstruction. You are right that the next step should be something like performing comparisons with known annotations or annotations from other organisms, or something like BUSCO. All those types of analyses are already provided by other tools. What we solved with RATTLE was the construction of the transcriptome without a reference. The approach to perform those comparisons could simply be running minimap2 to compare the 2 fasta files: the RATTLE transcriptome vs the annotation transcriptome. Other possibility could be to map the RATTLE transcriptome to the C. elegans genome using minimap2 to confirm the accuracy of the built transcriptome. Both analyses could indicate whether you have something in the RATTLE transcriptome that is not yet known from the genome or the annotation. Or could give you a benchmark for applying RATTLE to data from another worm species without a genome.

Sorry for for being negative, but I would like to understand how to gain as much knowledge from this tool as possible.

The possibilities are unlimited. You only need to define the question. The limit is in the machine :-) E. thank

…

Assa — Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCZKB6AARFFQA6ZXNISEPLV2SYBBANCNFSM57KVSZ3A> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 6 · 2022-08-23T12:06:25.000Z

Thanks for the very elaborate answer.

I must admit that I started working with the tool mainly because in the description is also says "quantification". I know I don't need a reference-free transcriptome, but I was hoping to be able to do a quantification based on a reference-free transcriptome.
If I understand you correctly, a "simple" quantification would be easier and faster using a reference-dependent methods such as nanoCounts or bambu.

But the suggestion with BUSCO seems interesting and I'll give it a go to check the results.

thanks again.

Assa