sequence aligning with align_sequence_to_polymer
rimmartin opened this issue · 10 comments
Hi @wojdyr & others,
Been trying to use the aligment tool on https://github.com/project-gemmi/gemmi/blob/master/tests/1orc.pdb
gemmi::Structure st1 = gemmi::read_pdb_file(filePath);
...
gemmi::AlignmentScoring scoring;
gemmi::AlignmentResult result = gemmi::align_sequence_to_polymer(st1.entities[entityIndex].full_sequence, st1.models[0].chains[chainIndex].get_polymer(), gemmi::PolymerType::Unknown, &scoring);
std::cout<<st1.models[0].chains[chainIndex].name<<" "<<st1.entities[entityIndex].name<<" match_count: "<<result.match_count<<" "<<result.score<<std::endl;
and get
A A match_count: 0 1844108996
The match count is zero.
Is there something I'm not setting up correctly?
align_sequence_to_polymer the right function to use?
I couldn't reproduce it:
>>> st = gemmi.read_structure('tests/1orc.pdb')
>>> result = gemmi.align_sequence_to_polymer(st.entities[0].full_sequence, st[0][0].get_polymer(), gemmi.PolymerType.Unknown)
>>> result.match_count
64
I also got the same result from:
gemmi align -v --query=A --target=+A tests/1orc.pdb
Could you post a minimal code that can be compiled?
I see it toggle between 0 & 4 on repeated runs
This is on a 64 bit Centos 7 and compiled with gcc-10.3.0-04202021
Build and run script in tar
gemmi::AlignmentScoring scoring={ 1, -1, -1, -1, 0, -2, {}, {} };
works better
while
gemmi::AlignmentScoring scoring;
may not be initialized
Seemed to be better but still zero match count...
Yes, that's it. AlignmentScoring is uninitialized in C++. I'll add member initializers, to make it consistent with Python, but I'll wait with it until I change the minimal standard to C++14.
Alternatively, you can use gemmi::AlignmentScoring::simple()
as the last arg for align_sequence_to_polymer().
No idea why you still get zero match count, though.
Agreed. And low priority for changes for me since there is a way to initialize.
Will keep experimenting to see why I'm not getting a good match count; using partial_model
0 0 A A match_count: 0 score :2097088
1orc 0 aligned: MEQRITLKDYAMRFGQ-TKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVY--AEEVKDGEVKPFP----
1orc : ................ ................................... ............. cigar 16M1I35M2I13M4I
1orc alignment completed
From miminal rhel6gemmiapp I now get the match count 64
Looking to replicate in our app. Thanks for the help/conversation today
Working in our app; returned to using gemmi::PolymerType::Unknown
instead of st1.entities[0].polymer_type
as the minimal app was set to gemmi::PolymerType::Unknown
:-)
one last thing to report: same cigar as the python test "2I64M5I"
Hi @wojdyr ,
With the tests/1orc.pdb example when I add
std::string sequenceRepresentation=result.add_gaps(gemmi::one_letter_code(st.entities[0].full_sequence), 2);
the output shows
sequenceRepresentation:
--MEQRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFP-----
match_string:
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| cigar:2I64M5I
comparing to the pdb's SEQRES the one letter code starts at full sequence MET GLU GLN ARG ILE THR... where as the pdb coordinates start at GLN ARG ILE THR...
Ah so Lance pointed out you probably increment on the polymer when dashes come out; looking the which =2 is for polymer sequences and 1 is for full sequence
std::string sequenceRespresentation=result.add_gaps(gemmi::one_letter_code((*it_chains).get_polymer().extract_sequence()), 2);
yields
1orc+H 0 aligned:
--QRITLKDYAMRFGQTKTAKDLGVYQSAINKAIHAGRKIFLTINADGSVYAEEVKDGEVKPFPSN-----
1orc+H match_string:
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| cigar:2I64M5I
an enumeration for the 1 & 2 would help guys like me:-)