Consider using SeqIO.index instead of looping through the whole FASTA.
Closed this issue · 3 comments
andersgs commented
While generating genus-specific DB, you loop through the whole DB:
def setup_allelespecific_database(database_folder, genus, allele_list):
"""
Since some genera have some rMLST genes missing, or two copies of some genes, genus-specific databases are needed.
This will take only the alleles known to be part of each genus and write them to a genus-specific file.
:param database_folder: Path to folder where confindr databases are stored.
:param genus: Genus of organism, as a string. First letter should be capitalized, everything else lowercase
:param allele_list: allele list generated by find_genusspecific_allele_list
"""
with open(os.path.join(database_folder, '{}_db.fasta'.format(genus)), 'w') as f:
sequences = SeqIO.parse(os.path.join(database_folder, 'rMLST_combined.fasta'), 'fasta')
for item in sequences:
if item.id in allele_list:
f.write('>' + item.id + '\n')
f.write(str(item.seq) + '\n')
Although, not tested, I think it might be faster if you index the FASTA:
index = SeqIO.index(rMLST_DB, "fasta")
seqs = [index[s] for s in allele_list]
SeqIO.write(seqs, GENUS_DB, "fasta")
Best. Anders.
lowandrew commented
I'd never come across SeqIO.index
before - I'll test this out and see if it is faster.
lowandrew commented
This gives a huge speedup, genus-specific database setup is 4-5 times faster now. Thanks for the suggestion, it's implemented in version 0.4.6!
andersgs commented
Awesome. Happy it helped.