OLC-Bioinformatics/ConFindr

Consider using SeqIO.index instead of looping through the whole FASTA.

Closed this issue · 3 comments

While generating genus-specific DB, you loop through the whole DB:

def setup_allelespecific_database(database_folder, genus, allele_list):
    """
    Since some genera have some rMLST genes missing, or two copies of some genes, genus-specific databases are needed.
    This will take only the alleles known to be part of each genus and write them to a genus-specific file.
    :param database_folder: Path to folder where confindr databases are stored.
    :param genus: Genus of organism, as a string. First letter should be capitalized, everything else lowercase
    :param allele_list: allele list generated by find_genusspecific_allele_list
    """
    with open(os.path.join(database_folder, '{}_db.fasta'.format(genus)), 'w') as f:
        sequences = SeqIO.parse(os.path.join(database_folder, 'rMLST_combined.fasta'), 'fasta')
        for item in sequences:
            if item.id in allele_list:
                f.write('>' + item.id + '\n')
                f.write(str(item.seq) + '\n')

Although, not tested, I think it might be faster if you index the FASTA:

index = SeqIO.index(rMLST_DB, "fasta")
seqs = [index[s] for s in allele_list]
SeqIO.write(seqs, GENUS_DB, "fasta")

Best. Anders.

I'd never come across SeqIO.index before - I'll test this out and see if it is faster.

This gives a huge speedup, genus-specific database setup is 4-5 times faster now. Thanks for the suggestion, it's implemented in version 0.4.6!

Awesome. Happy it helped.