RNAcentral/rnacentral-sequence-search

Set the Z value dynamically according to the database used

carlosribas opened this issue · 5 comments

If someone searches just in miRBase, it should be miRBase-specific

Hi @blakesweeney. Just for the record, I added the esl-seqstat command to rnacentral-import-pipeline. The idea is to put this file somewhere where I can download and parse the results.

Hey @blakesweeney! There is a problem running the esl-seqstat command in pdbe:

$ esl-seqstat pdbe-0.fasta
Parse failed (sequence file pdbe-0.fasta):
Line 6316: illegal character F

We also have this F character on lines 7466 and 12603. Any suggestions on how to solve this without being manually?

Without looking at those sequences, I'm betting they are tRNA and the F character is the amino acid on it. There are likely other cases with different characters as well. The easiest thing to do would be exclude those sequences from search, but I'm not sure that is a good idea. Another choice is to strip those characters off the sequence, which has other possible issues. I'd lean toward doing a very crude modification of the sequences to strip off things that are not ACGU, from the end of tRNA sequences only, but that is something that @AntonPetrov would need to weigh in on.

This is not a new problem: in previous releases we generated a special fasta file for the old search (the _excluded file contained all the exceptional sequences): http://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/13.0/sequences/.internal/

Is it possible to continue excluding some sequences from sequence search as before?

Sure, we can exclude them like we do currently. I'll add that filtering step to this export as well.