speed up panphlan_map.py with intervaltree?
nick-youngblut opened this issue · 0 comments
nick-youngblut commented
In the code:
with open(reads_file, mode='r') as IN:
for line in IN:
words = line.strip().split('\t')
# words = CONTIG, POSITION, REFERENCE BASE, COVERAGE, READ BASE, QUALITY
contig, position, abundance = words[0], int(words[1]), int(words[3])
# For each gene in the contig, if position in range of gene, increase its abundance
if contig in contig2gene.keys():
for gene, (fr,to) in contig2gene[contig].items():
if position in range(fr, to+1):
genes_abundances[gene] += abundance
# WRITE
if args.output == None:
for g in genes_abundances:
if genes_abundances[g] > 0:
sys.stdout.write(str(g) + '\t' + str(genes_abundances[g]) + '\n')
you are searching for positions in a range via if position in range(fr, to+1)
. Changing contig2gene
to an intervaltree data structure will likely substantially speed this up.