amino3to1 in protein-protein complexes
karlafej opened this issue · 3 comments
Some pdb files consist of different protein chains with different amino acid sequence, for example 5mtn. It would be great if amino3to1 took this into account and returned something like dictionary of chain_ids and corresponding series of 1-letter codes.
At the moment, amino3to1 for 5mtn returns
SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYTSVSSST
although the residues in the pdb file are
>5mtn.pdb chain A
SLEPEPWFFK NLSRKDAERQ LLAPGNTHGS FLIRESESTA GSFSLSVRDF DQGEVVKHYK
IRNLDNGGFY ISPRITFPGL HELVRHYT
>5mtn.pdb chain B
SVSSVPTKLE VVAATPTSLL ISWDAPAVTV VYYLITYGET GSPWPGGQAF EVPGSKSTAT
ISGLKPGVDY TITVYAHRSS YGYSENPISI NYRT
Thanks for pointing this out @karlafej !
I haven't worked with multi-chain proteins in recent projects and completely forgot to include them in the test cases, which should definitely be addressed like you said. I just see that there's another problem in the current implementation since it assumes unique residue numbers, which is a bad assumption for multi-domain cases ... I will fix that :).
About the returned values from the amino3to1 function. I think a dictionary could be a good idea, like you suggested, but I would favor returning a list of string sequences to preserve the order in which the chains appear in the PDB flle.
For example, for 5mtn is would return
['SLEPEPWFFK...', 'SVSSVPTKLE...']
and the chain ideas could be obtained via
pdb.df['ATOM']['chain_id'].unique()
if desired. For instance, one could iterate
for sequence, chain_id in zip(amino3to1_results, pdb.df['ATOM']['chain_id'].unique()):
# do something
Alternatively, amino3to1 could return a list of tuples
[('A', 'SLEPEPWFFK...'), ('B', 'SVSSVPTKLE...')]
Any thoughts?
Thank you!