About the output by using -O1
ljl000 opened this issue · 10 comments
Hi, ogotoh
I got a question when I use spaln to align my protein sequences to genome sequences by using the -O1 option. The result just like this
so,I want to know if the M(=)50 means the identities between the query and the genome sequences.Why does the result showed seem to be aligned just well enough?
Thanks a lot!
Yes, M means the identities between the query and the genome sequences. You may simply count the number of columns for which the translated residue in the first line ('J' should be read as 'S') is identical with the query residue in the third line in each alignment block. I made a small mistake in my previous response. Actually, P = 100 * M / (M + N + U) rather than P = 100 * M / (M + N + G) . So, you might get the impression that P is smaller than that intuitively expected from the alignment when it contains several long gaps.
Osamu,
I know that (=) numbers means the Identical residues between the query and the genome sequences.
Just like this. But My question is how to explain the residues in the first line which do not same as the query residues in the third line, just like this
,I mean what does these pair of the residues mean? It can't be explain simple as identical residues, can I treat these as positive matches just like performing the Tblastn?
Look forward for your comment! Just thank you again!
Remember that blast (including tblastn) is a local alignment tool whereas spaln generates semi-global alignment (unless you set –LS option), implying that it try to align all the query residues to a specific range of genomic sequence. If the query is not the direct product of the gene in the genomic sequence but a product of homologous (paralogous or orthologous) gene of the same or other species, it is quite general to observe such mismatched pairs in the alignment. Is this the answer you expect or do you want to ask something else?
Osamu
Yes, my query is exactly not the direct product of the gene in the genomic sequences but a product of homology of other species. So, performing this kind of alignment, you recommend to set the -LS option? Or you got some other advice?
The alignment shown above looks fine, suggesting that spaln cached correct gene structure. However, this gene appears to be intronless and so relatively easy to predict. I usually use -LS option for mapping cDNA (EST in particular) sequences but rarely use for mapping protein sequences.
I just paste the good result above, but the other results always seems not just well enough. Because the results now often be fragmentized. How can I improve the performance to mapping homology protein sequences to genomic sequences ? If there any advice you may propose?
Simply your reference protein sequences and the target genomic sequence appear to be too remote to be faithfully aligned with spaln. Please refer to my original paper in Bioinformatics (2008) to get rough estimate of limitation of spaln, although the limitation will considerably varies with genome size, sequence quality, intron density, and so on.
Although not extensively examined by myself, one potential solution is to use spaln in protein database search mode:
spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment
where probdb might be SwissProt or other protein sequence database pre-formatted with makeidx.pl -a
, and genomic_segment is a segment of your genome which may encode one or several genes.
Well, I think you may misunderstanding my question, my query is protein sequences and my reference sequences are the genomic sequence data. So, when you suggest performing the spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment options, you actually means spaln -Q7 -d xxxgnm protein_sequence. Am I right?
Thanks for your comment. Indeed my situation is not so easy to describe, but you really help me a lot about using spaln to solve my problems. I've sorted it out. Thanks again!