Acribbs/tRNAnalysis

identified fragments are few and with non-normal length distribution

Opened this issue · 5 comments

Hi Adam,

if my method of getting the sequences of the identified fragments from the fasta file was correct, there might be something off in my application of the pipeline. I was not able to find any procedural errors, and also the standard output looks sensible at least.

However, the individual fragments produced are only a small fraction of what we identified using the MINTmap pipeline, and I noticed an interesting distribution of fragment lengths (see attached PDF).

nchars_seqs.pdf

Specifically, there are no fragments detected in certain areas where we usually see quite a lot of tRFs, such as 16-19 and 21-25. I am not sure how this can be rectified, but if you're interested I am attaching, of only one of the samples, the modified BED (extended by the sequence and MINTplate via my method) and the corresponding columns of the raw count tables resulting from the tRNAnalysis pipeline and the MINTmap (only the "exclusive" fragments) pipeline; maybe you can get some clues as to what I did wrong.

sample_40039_files.zip

Kind regards,
Sebastian

Hmm thanks for this. I will investigate further. You may not have done anything wrong and it could be due to conservative filtering/mapping implemented in our pipeline.

We tested our pipeline using our own data and an example online dataset and didn't observe that unusual distribution.

I will be away travelling for quite a lot of July but my student can have a look at this while im away. I suspect it could be down to the mapping procedure or filtering.

Do you have a small test input file I could use to investigate further (maybe a random downsampled file?)

No problem, thanks for looking into it! I have hundreds of samples from different experiments, just tell me exactly what would suit your purpose (i.e. approx. how many reads, how many samples, from the same experiment or different ones, we also have different read lengths..). Then I'll upload it in the coming days.

Kind regards,
Sebastian

Any file would do, maybe downsampled to a million reads.

My feeling is that it is related to the mapping step:

statement = """bowtie -n 3 -k 1 --best -e 800 --sam %(index_name)s %(fastq)s 2> pre_mapping_bams.dir/%(fastq_name)s.log |

Because I tested the code using quite short read sequencing and I think that specifying such a tight threshold for quality and setting -e 800 has removed some reads of certain length and could explain the unusual distribution.

BW,
Adam

subsample.fastq.gz

it is a human sample, from whole blood, already quality filtered with flexbar. hope this works!

kr, sebastian

Hi Sebastian,

Thanks for the sample.

The first thing I can say about your sample is that it doesn't have that many tRNAs in comparison to other data that I have seen. I ran benchmarking between MINTmap and tRNAnalysis and both cases there was quite a small proportion of the reads actually counting to tRNAs, quite a lot to miRNA though.

I have been benchmarking the alignment and I have improved the alignment for bowtie and I am now able to call twice as many tRNAs before. Also, when I look at the most significant expressed tRNAs they correlate well with MINTmap. I have made a push to the main repo so that bowtie can now be parametrised so the user has more control over the mapping options. I suspect this may have to be set for individual datasets given that the quality of mapping will depend on the distribution of the library.

I am now looking further into the fragment analysis.

I haven't seen many major issues when comparing with fragments between MINTmap and tRNAnalysis. However, I have noticed that there are some features in our bed file that are duplicated for certain tRNAs, but I wouldnt have thought that it would make too much of a difference. I will look at this when I get back from holiday and get back to you.