Parse Organism Identifier from FASTA data
Closed this issue · 0 comments
In short, we should parse the OX field from the FASTA header and add it to the dataframe. This field will allow us to build our dataset much more efficiently by allowing us to make fewer queries to the Uniprot REST API.
The OX field provides the Organism Identifier. All of the 569,213 sequences in the Uniprot database map down to 14,403 organisms. Thus by performing lookups using the Organism Identifier rather than the sequence Unique Identifier, we can reduce our total query count by a factor of 40.
To put this in perspective, if a single query takes 1 second to complete, it would take almost 7 days to perform 569213 lookups (lookup by sequence) vs only 4 hours to perform 14403 lookups (lookup by organism).
References: