mhalushka/miRge3.0

Question about Read Length Distribution

Closed this issue · 3 comments

I run miRge3.0 like this,with Qiage UMI:
miRge3.0 -s SRR13077007.fastq -db miRBase -lib miRge3_Lib -on human -a AACTGTAGGCACCATCAAT --qiagenumi -umi 0,12 -o output_dir -cpu 10 -udd

Is the read length == len(pureSeq) + 12

Hi @hy101,

The read length distribution is based on the length of the templates/miRNAs and in your example it would be len(pureSeq). The distribution should have a higher depth around 20-25 nucleotides. If it is left skewed, then probably the adapters were not trimmed well. I hope this answers your question.

Thank you,
Arun.

final_seq = pureSeq + umi_seq

final_seq = fqreads.sequence + umi_seq
if int(len(final_seq)) >= int(min_len):
if str(final_seq) in readDict:
readDict[str(final_seq)]+=1
else:
readDict[str(final_seq)]=1

trimmed_pairs = list(readDict.items())
return (count, trimmed_pairs)

len(final_seq) is used to plot Read Length Distribution
for fqres_pairs in concurrent.futures.as_completed(future):
a , b = fqres_pairs.result()
count+= a
for each_list in b:
varx = list(each_list)
try:
visual_treat['rlen'][str(inFileBaseArray[index])].append(len(varx[0]))
except KeyError:
visual_treat['rlen'][str(inFileBaseArray[index])]= [len(varx[0])]

rlenDistID = "readLengthID_" + str(div_idnum)
val = visual_treat['rlen'][sample_files]
#print(sample_files)
maxVal = sorted(val)[-1]
minVal = sorted(val)[0]
hist, bins = np.histogram(val, bins=(maxVal-minVal))
hist = hist.tolist()
#print(hist, bins)
histData.readLenDist(rlenDistID, sample_files, str(hist), str(list(bins)))

@hy101,

You are right. It does take the length of both template and umi for plotting read length distribution. Thank you for bringing it to our attention. I will make a change for this in the next release. If you wish to make PR, please go ahead.

Thank you,
Arun.