Question about Read Length Distribution

I run miRge3.0 like this，with Qiage UMI:
miRge3.0 -s SRR13077007.fastq -db miRBase -lib miRge3_Lib -on human -a AACTGTAGGCACCATCAAT --qiagenumi -umi 0,12 -o output_dir -cpu 10 -udd

Is the read length == len(pureSeq) + 12

Hi @hy101,

The read length distribution is based on the length of the templates/miRNAs and in your example it would be len(pureSeq). The distribution should have a higher depth around 20-25 nucleotides. If it is left skewed, then probably the adapters were not trimmed well. I hope this answers your question.

Thank you,
Arun.

final_seq = pureSeq + umi_seq

miRge3.0/mirge/libs/digest.py

Lines 339 to 344 in 13c9eda

    
           final_seq = fqreads.sequence + umi_seq 
        
           if int(len(final_seq)) >= int(min_len): 
        
               if str(final_seq) in readDict: 
        
                   readDict[str(final_seq)]+=1 
        
               else: 
        
                   readDict[str(final_seq)]=1

miRge3.0/mirge/libs/digest.py

Lines 357 to 358 in 13c9eda

    
               trimmed_pairs = list(readDict.items()) 
        
           return (count, trimmed_pairs)

len(final_seq) is used to plot Read Length Distribution

miRge3.0/mirge/libs/digest.py

Lines 141 to 149 in 13c9eda

    
           for fqres_pairs in concurrent.futures.as_completed(future): 
        
               a , b = fqres_pairs.result() 
        
               count+= a 
        
               for each_list in b: 
        
                   varx = list(each_list) 
        
                   try: 
        
                       visual_treat['rlen'][str(inFileBaseArray[index])].append(len(varx[0])) 
        
                   except KeyError: 
        
                       visual_treat['rlen'][str(inFileBaseArray[index])]= [len(varx[0])]

miRge3.0/mirge/libs/digest.py

Lines 265 to 273 in 13c9eda

    
           rlenDistID = "readLengthID_" + str(div_idnum) 
        
           val = visual_treat['rlen'][sample_files] 
        
           #print(sample_files) 
        
           maxVal = sorted(val)[-1] 
        
           minVal = sorted(val)[0] 
        
           hist, bins = np.histogram(val, bins=(maxVal-minVal)) 
        
           hist = hist.tolist() 
        
           #print(hist, bins) 
        
           histData.readLenDist(rlenDistID, sample_files, str(hist), str(list(bins)))

@hy101,

You are right. It does take the length of both template and umi for plotting read length distribution. Thank you for bringing it to our attention. I will make a change for this in the next release. If you wish to make PR, please go ahead.

Thank you,
Arun.

	final_seq = fqreads.sequence + umi_seq
	if int(len(final_seq)) >= int(min_len):
	if str(final_seq) in readDict:
	readDict[str(final_seq)]+=1
	else:
	readDict[str(final_seq)]=1

	trimmed_pairs = list(readDict.items())
	return (count, trimmed_pairs)

	for fqres_pairs in concurrent.futures.as_completed(future):
	a , b = fqres_pairs.result()
	count+= a
	for each_list in b:
	varx = list(each_list)
	try:
	visual_treat['rlen'][str(inFileBaseArray[index])].append(len(varx[0]))
	except KeyError:
	visual_treat['rlen'][str(inFileBaseArray[index])]= [len(varx[0])]

	rlenDistID = "readLengthID_" + str(div_idnum)
	val = visual_treat['rlen'][sample_files]
	#print(sample_files)
	maxVal = sorted(val)[-1]
	minVal = sorted(val)[0]
	hist, bins = np.histogram(val, bins=(maxVal-minVal))
	hist = hist.tolist()
	#print(hist, bins)
	histData.readLenDist(rlenDistID, sample_files, str(hist), str(list(bins)))