project-gemmi/gemmi

sequence aligning with align_sequence_to_polymer

rimmartin opened this issue · 10 comments

Hi @wojdyr & others,

Been trying to use the aligment tool on https://github.com/project-gemmi/gemmi/blob/master/tests/1orc.pdb

        gemmi::Structure st1 = gemmi::read_pdb_file(filePath);
...
                gemmi::AlignmentScoring scoring;
                gemmi::AlignmentResult result = gemmi::align_sequence_to_polymer(st1.entities[entityIndex].full_sequence, st1.models[0].chains[chainIndex].get_polymer(), gemmi::PolymerType::Unknown, &scoring);
                std::cout<<st1.models[0].chains[chainIndex].name<<" "<<st1.entities[entityIndex].name<<" match_count: "<<result.match_count<<" "<<result.score<<std::endl;

and get

A A match_count: 0 1844108996

The match count is zero.
Is there something I'm not setting up correctly?
align_sequence_to_polymer the right function to use?

I couldn't reproduce it:

>>> st = gemmi.read_structure('tests/1orc.pdb')
>>> result = gemmi.align_sequence_to_polymer(st.entities[0].full_sequence, st[0][0].get_polymer(), gemmi.PolymerType.Unknown)
>>> result.match_count
64

I also got the same result from:

gemmi align -v --query=A --target=+A tests/1orc.pdb

Could you post a minimal code that can be compiled?

rhel6gemmiapp.tar.gz

I see it toggle between 0 & 4 on repeated runs

This is on a 64 bit Centos 7 and compiled with gcc-10.3.0-04202021

Build and run script in tar

                gemmi::AlignmentScoring scoring={ 1, -1, -1, -1, 0, -2, {}, {} };

works better
while

                gemmi::AlignmentScoring scoring;

may not be initialized
Seemed to be better but still zero match count...

Yes, that's it. AlignmentScoring is uninitialized in C++. I'll add member initializers, to make it consistent with Python, but I'll wait with it until I change the minimal standard to C++14.
Alternatively, you can use gemmi::AlignmentScoring::simple() as the last arg for align_sequence_to_polymer().

No idea why you still get zero match count, though.

Agreed. And low priority for changes for me since there is a way to initialize.

Will keep experimenting to see why I'm not getting a good match count; using partial_model

0 0 A A match_count: 0 score :2097088
1orc 0 aligned: MEQRITLKDYAMRFGQ-TKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVY--AEEVKDGEVKPFP----
1orc        : ................ ...................................  .............     cigar 16M1I35M2I13M4I
1orc alignment completed 

From miminal rhel6gemmiapp I now get the match count 64

Looking to replicate in our app. Thanks for the help/conversation today

Working in our app; returned to using gemmi::PolymerType::Unknown instead of st1.entities[0].polymer_type as the minimal app was set to gemmi::PolymerType::Unknown :-)

one last thing to report: same cigar as the python test "2I64M5I"

Hi @wojdyr ,

With the tests/1orc.pdb example when I add

            std::string sequenceRepresentation=result.add_gaps(gemmi::one_letter_code(st.entities[0].full_sequence), 2);

the output shows

 sequenceRepresentation: 
--MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFP-----
 match_string: 
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||      cigar:2I64M5I

comparing to the pdb's SEQRES the one letter code starts at full sequence MET GLU GLN ARG ILE THR... where as the pdb coordinates start at GLN ARG ILE THR...

Ah so Lance pointed out you probably increment on the polymer when dashes come out; looking the which =2 is for polymer sequences and 1 is for full sequence

                                    std::string sequenceRespresentation=result.add_gaps(gemmi::one_letter_code((*it_chains).get_polymer().extract_sequence()), 2);

yields

1orc+H 0 aligned: 
--QRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFPSN-----
1orc+H match_string: 
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||      cigar:2I64M5I

an enumeration for the 1 & 2 would help guys like me:-)