EddyRivasLab/infernal

cmalign on 16S sequences generating insertions only

Closed this issue · 3 comments

Dear Eric,

I've been using Infernal and the ssu-align models for 16S-based analyses for a couple of years now, and the tool has simply been working great on all kinds of different datasets I tried. For the first time, however, I have now encountered a problem where an entire dataset of 16S sequences (V2 region, length ~230bp) simply does not align.

Here are the first few lines of an example FASTA file:

>AfElphSD3_1246037
CTGGACCGTCTCTCAGTTCCAATGTGGCCGTTCAACCTCTCAGTCCGGCTACTGATCGCAGACTTGGTGAGCCGTTACCTCACCAACTATCTAATCAGACGCGAGCCCATCCTTGCGCAATAAATCTTTGATAGCAAAAACATGCGATTTCGCTATATTATGCGGTATTAGCTTCCGTTTCCAGAAGGTATCCCCCACACAAGGGCAGGTTG
>AfElphSD3_1050060
CTGGACCGTGTCTCAGTTCCAGTGTGGCCGTTCACCCTCTCAGGTCGGCTACGCATCCTCGCCTTGGTGGGCTATTATCTCACCAACTAGCTAATGCGCCGCAGGTCCATCTTTCAGCGAGGCTTTAAAGGCCCCTTTTAGAGATCGGAGATGCCTCCAATCCAATTATTCGGTATTAGCCAAGGTTTCCCTCGGGTATCCCAAACTGTAAGGCAGGTTACCTACGTGTTA
>AfElphSD3_1160041
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTCACCCTCTCAGGCCGGCTACTGATCGTCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATCAGACGCGGGTCCATCGTATACCACCTCAGTTTTTCACACTAGACCATGCGGTCCTGTGCGCTTATGCGGTATTAGCACCTATTTCTAAGTGTTATCCCCCTGTATACGGCAGGTTACCCACGCGTTACTCACCCG
>AfElphSD3_1001485
CTGGACCGTGTCTCAGTTCCAATGTGGCTGATCGTCCTCTCAGACCAGCTATAGATCGTAGGCTTGGTAGGCCTTTACCCCACCAACTACCTAATCTTCCGCGGGCCGATATTATAGCGATAAATCTTTCCCCCTCAGGGATTATGCGGTATTAGCAGAAGTTTCCCTCTGTTGTCCCCCACTACAATGTACGTTCCCACGTGTTACTCACCAGTGCGCTACTCT
>AfElphSD3_1003807
CTGGTCCGTGTCTCAGTACCAGTGTGGGGGGTAAGCCTCTCAGCTCCCCTAGGCATCGTAGGCTTGGTGGGCCGTTGCCCCGCCAACTACCTAATGCCACGCATGCCCATCTAAAACCAAGGTTCTTTCCCCGCCCCCTCATGCGGGGGGACGGGAATATGGGGTATTAGTTCGGGTTTCCCCGAGTTATCCCCCGGTTGAAGGTAGGTTGCATACGCGTTACGCACCCGTGCGCCGGTCGCCAGCGGAGCGTTG

I BLASTed these sequences, and they produce hits to the NCBI 16S nr database over the full length. However, when I run the following command in cmalign (Infernal v1.1.2) against the bacteria-0p1.1.cm model from ssu-align, the alignment fails (almost everything is small letters == insertions in the output):

cmalign --sub --notrunc --cpu 20 --outformat Pfam -g -o aligned.tmp <bacteria-0p1.1.cm> tmp.fasta

In fact, the output looks like this (sorry for posting these long lines):

AfElphSD3_1246037         ........................................................cuggaccgucucucaguuccaauguggccguucaaccucucaguccggcuacugaucgcagacuuggugagccguuaccucaccaacuaucuaaucagacgcgagcccauccuugcgcaauaaaucuuugauagcaaaaacaugcgauuucgcuauauuaugcgguauuagcuuccguuuccagaagguaucccccacacaagggcaggUUG------------------.---------------------------.-------------------------------------------------...-------------.--------------------------------.------..-----------------------------------------------------..----------------.----------..----...--------.-...-----------.------------------.--..--------..-------.....----------------.----.-----.--------.---------.-------------------------------------.--------.--------...-..-.----------------------------------------.----------------.----.-----------------------.----..---------------.--------------.-------------.-----------------------------------..------------------------------------.---------------------------------------------------------...---.--------------------------------..------------------........--------------------------------------------------.-----------...-------------------..----------------------------------------.--------------------------------------------------...---------------------------.--.---------------------------------..--------.---...---.-------.----------------.---------------.------------------------------.------------......----.-----..-----------------.----..--.....----....................................................................................................................................................................................................................................-----.......--.-------------.---------------------------.----..------.------------.------------.--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------..----------------.---------------------------------------------------------------------------------------------------------------------------------------------------.-----------------------------------------
#=GR AfElphSD3_1246037 PP ........................................................********************************************************************************************************************************************************************************************************************.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
AfElphSD3_1050060         .....................................cuggaccgugucucaguuccaguguggccguucacccucucaggucggcuacgcauccucgccuuggugggcuauuaucucaccaacuagcuaaugcgccgcagguccaucuuucagcgaggcuuuaaaggccccuuuuagagaucggagaugccuccaauccaauuauucgguauuagccaagguuucccucggguaucccaaacuguaaggcagguuaccuacgugUUA------------------.---------------------------.-------------------------------------------------...-------------.--------------------------------.------..-----------------------------------------------------..----------------.----------..----...--------.-...-----------.------------------.--..--------..-------.....----------------.----.-----.--------.---------.-------------------------------------.--------.--------...-..-.----------------------------------------.----------------.----.-----------------------.----..---------------.--------------.-------------.-----------------------------------..------------------------------------.---------------------------------------------------------...---.--------------------------------..------------------........--------------------------------------------------.-----------...-------------------..----------------------------------------.--------------------------------------------------...---------------------------.--.---------------------------------..--------.---...---.-------.----------------.---------------.------------------------------.------------......----.-----..-----------------.----..--.....----....................................................................................................................................................................................................................................-----.......--.-------------.---------------------------.----..------.------------.------------.--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------..----------------.---------------------------------------------------------------------------------------------------------------------------------------------------.-----------------------------------------
#=GR AfElphSD3_1050060 PP .....................................***************************************************************************************************************************************************************************************************************************************.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

And the log:

     1  AfElphSD3_1246037     212        1        3     3'    -70.21   1.000       0.00       0.00       0.10      2.77
     2  AfElphSD3_1050060     231        1        3     3'    -74.94   1.000       0.00       0.00       0.21      3.03
     3  AfElphSD3_1160041     231        1        8     3'    -76.49   0.999       0.00       0.00       0.13      3.02
     4  AfElphSD3_1001485     225        1        7     3'    -73.53   1.000       0.00       0.00       0.16      2.93
     5  AfElphSD3_1003807     255        1        3     3'    -83.91   1.000       0.00       0.00       0.22      3.32
     6  AfElphSD3_1033555     226        1        7     3'    -73.83   1.000       0.00       0.00       0.16      2.99

I've tried playing with the -g option, I've left out the --sub --notrunc, I've even tried reversing the sequences (the log I posted is on reversed seqs), but to no avail. I don't see why the alignment would fail in principle, so I suspect that I am doing something wrong.

Best,

Sebastian

Hi Sebastian,

I believe the problem is that the sequences are reverse complemented 16S fragments. You said that you reversed the sequences and retried the cmalign call, but did you reverse-complement them? When I did that I get good, high-scoring alignments.

The way I discovered this problem was by running the sequences through SSU-ALIGN (e.g. 'ssu-align tmp.fasta schmidt'). That identifies high scoring bacterial SSU fragments. Then by looking at the file schmidt/schmidt.scores, I could tell that the hits were to reverse complements of the input sequences (stop position is 1, start position is 212, for example). Let me know if you have other questions.

Eric

Dear Eric,

I admit that I feel a little stupid not to have thought of that! I will blame over-working and sleep-deprivation for that, but what you write makes a lot of sense. I will reverse-complement my sequences, re-run and report if everything goes smoothly (and also in the unlikely event that it does not...).

Thanks for the fast reply, and thanks once again for providing these great and useful tools :-)

Best,

Sebastian

Yep, the reverse-complements work. Thanks again!