ppdb.amino3to1()
tbc01 opened this issue · 10 comments
Appears to drop from df['ATOM'] on only 'residue_number' rather than including 'insertion.'
Could you explain or illustrate that with an example, i.e., the current behavior vs. what you would expect? That would be very helpful.
EDIT: This only is an issue when reading the pdb from a file... not when you use fetch_pdb
ppdb = PandasPdb().read_pdb(PDB_PATH + '2d7t.pdb')
sequence = ppdb.amino3to1()
''.join(sequence.loc[sequence['chain_id']=='H','residue_name'])
Note that residues 82A=R, 82B=R,82C=L are missing. Dropping duplicates on 'residue_number','insertion' (rather than just residue_number) from ppdb.df['ATOM'] gives the correct sequence.
I see now ...
I think doing that on insertion codes as well may be useful, indeed. In the case of 2d7b (the online version in the RCSB PDB) repo, it wouldn't help though because they didn't use insertion codes.
Instead of identifying unique residues based on just the residue number, doing that based on chain ID and residue name as you did might work though. I could easily add sth like this to the amino3to1() function if that's helpful.
I remember-- the reason the insertion codes are in there is that i downloaded these files from the SAbDab database: http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/ ... I used the files that have been renumbered to chothia numbering, which includes insertion codes.
Here is an example (had to rename to .txt) of the file i was talking about where current functionality appears to skip the 3 amino acids i discussed above.:
2d7t.txt
I think if insertion codes are present, it would certainly be useful to consider them. I will modify the 3to1 method to include these, which should then at least help with some of the files. If you have any additional suggestions, please let me know!
That should do it. Thanks!
Let me reopen the issue until the PR is merged so we won't forget
Just to make sure that I got it right, the sequence you'd expect for the following code on your downloaded file would be
ppdb = PandasPdb().read_pdb('2d7t.pdb')
sequence = ppdb.amino3to1()
sequence[50:60]['residue_name']
383 I
391 N
399 P
406 K
415 S
421 G
425 D
433 T
440 N
448 Y
? Note that it had an insertion code after residue 52, so residue 52A is supposed to be the Proline at third position
Correct: currently
ppdb = PandasPdb().read_pdb('2d7t.pdb') sequence = ppdb.amino3to1() sequence[50:60]['residue_name']
Generates:
383 I 391 N 406 K 415 S 421 G 425 D 433 T 440 N 448 Y 460 A
So looks like what you have now is good to go.
alright. just made a minor version update that should be available from PyPI now (0.2.5) that includes the fix. Thanks for the hint!