About the output by using -O1

Question

About the output by using -O1

ljl000 opened this issue 5 years ago · 10 comments

Hi, ogotoh
I got a question when I use spaln to align my protein sequences to genome sequences by using the -O1 option. The result just like this

so,I want to know if the M(=)50 means the identities between the query and the genome sequences.Why does the result showed seem to be aligned just well enough?
Thanks a lot!

Answer 1 · 2019-09-05T03:12:38.000Z

Yes, M means the identities between the query and the genome sequences. You may simply count the number of columns for which the translated residue in the first line ('J' should be read as 'S') is identical with the query residue in the third line in each alignment block. I made a small mistake in my previous response. Actually, P = 100 * M / (M + N + U) rather than P = 100 * M / (M + N + G) . So, you might get the impression that P is smaller than that intuitively expected from the alignment when it contains several long gaps.

Osamu,

Answer 2 · 2019-09-05T04:35:39.000Z

I know that (=) numbers means the Identical residues between the query and the genome sequences.
Just like this. But My question is how to explain the residues in the first line which do not same as the query residues in the third line, just like this
,I mean what does these pair of the residues mean? It can't be explain simple as identical residues, can I treat these as positive matches just like performing the Tblastn?
Look forward for your comment! Just thank you again!

Answer 3 · 2019-09-05T05:37:57.000Z

Remember that blast (including tblastn) is a local alignment tool whereas spaln generates semi-global alignment (unless you set –LS option), implying that it try to align all the query residues to a specific range of genomic sequence. If the query is not the direct product of the gene in the genomic sequence but a product of homologous (paralogous or orthologous) gene of the same or other species, it is quite general to observe such mismatched pairs in the alignment. Is this the answer you expect or do you want to ask something else?

Osamu

Answer 4 · 2019-09-05T07:40:03.000Z

Yes, my query is exactly not the direct product of the gene in the genomic sequences but a product of homology of other species. So, performing this kind of alignment, you recommend to set the -LS option? Or you got some other advice?

Answer 5 · 2019-09-05T09:00:20.000Z

The alignment shown above looks fine, suggesting that spaln cached correct gene structure. However, this gene appears to be intronless and so relatively easy to predict. I usually use -LS option for mapping cDNA (EST in particular) sequences but rarely use for mapping protein sequences.

Answer 6 · 2019-09-05T09:24:57.000Z

I just paste the good result above, but the other results always seems not just well enough. Because the results now often be fragmentized. How can I improve the performance to mapping homology protein sequences to genomic sequences ? If there any advice you may propose?

Answer 7 · 2019-09-06T09:08:26.000Z

Simply your reference protein sequences and the target genomic sequence appear to be too remote to be faithfully aligned with spaln. Please refer to my original paper in Bioinformatics (2008) to get rough estimate of limitation of spaln, although the limitation will considerably varies with genome size, sequence quality, intron density, and so on.
Although not extensively examined by myself, one potential solution is to use spaln in protein database search mode:
spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment
where probdb might be SwissProt or other protein sequence database pre-formatted with makeidx.pl -a, and genomic_segment is a segment of your genome which may encode one or several genes.

Answer 8 · 2019-09-06T11:15:55.000Z

Well, I think you may misunderstanding my question, my query is protein sequences and my reference sequences are the genomic sequence data. So, when you suggest performing the spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment options, you actually means spaln -Q7 -d xxxgnm protein_sequence. Am I right?

Answer 9 · 2019-09-11T08:06:32.000Z

Sorry for the delay in response. I was off from my office until this morning. I don't know exactly your situation, so I suggested a potentially alternative way to use spaln. I guess you are trying to solve a difficult gene annotation problem in which no close transcript reference sequences are available. What protein sequences are you using as the references? If you find good tblastn hits but fail to find good spaln hists, you may use the alignment-only mode of spaln as: `spaln -Q[0-3] -d your_genome -O1 -T table '$chromose/contig_id from to [<]' reference_aa` where from and to refer to the range of potential gene region on the chromosome, and optional '<' means that the gene resides on the reverse strand. However, when the reference and the target genome are evolutionarily distant, reliable gene structure prediction is difficult, as I said the other day.

…

________________________________ 差出人: ljl000 <notifications@github.com> 送信日時: 2019年9月6日 20:15 宛先: ogotoh/spaln <spaln@noreply.github.com> CC: 後藤修 <o.gotoh@aist.go.jp>; Comment <comment@noreply.github.com> 件名: Re: [ogotoh/spaln] About the output by using -O1 (#22) Well, I think you may misunderstanding my question, my query is protein sequences and my reference sequences are the genomic sequence data. So, when you suggest performing the spaln -Q7 -a prodb [-M_N_] [other options] genomic_segment options, you actually means spaln -Q7 -d xxxgnm protein_sequence. Am I right? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#22?email_source=notifications&email_token=AH6C4LQ54OGMGX5XYFNOBRDQII3WXA5CNFSM4ITNP7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CRETI#issuecomment-528814669>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH6C4LQGMTO6VEFH5DMRQ4LQII3WXANCNFSM4ITNP7LA>.

Answer 10 · 2019-09-12T04:50:03.000Z

Thanks for your comment. Indeed my situation is not so easy to describe, but you really help me a lot about using spaln to solve my problems. I've sorted it out. Thanks again!