SegataLab/panphlan

speed up panphlan_map.py with intervaltree?

nick-youngblut opened this issue · 0 comments

In the code:

        with open(reads_file, mode='r') as IN:
            for line in IN:
                words = line.strip().split('\t')
                # words = CONTIG, POSITION, REFERENCE BASE, COVERAGE, READ BASE, QUALITY
                contig, position, abundance = words[0], int(words[1]), int(words[3])
                # For each gene in the contig, if position in range of gene, increase its abundance
                if contig in contig2gene.keys():
                    for gene, (fr,to) in contig2gene[contig].items():
                        if position in range(fr, to+1):
                            genes_abundances[gene] += abundance
        # WRITE
        if args.output == None:
            for g in genes_abundances:
                if genes_abundances[g] > 0:
                    sys.stdout.write(str(g) + '\t' + str(genes_abundances[g]) + '\n')

you are searching for positions in a range via if position in range(fr, to+1). Changing contig2gene to an intervaltree data structure will likely substantially speed this up.