BioPandas/biopandas

ppdb.amino3to1()

tbc01 opened this issue · 10 comments

tbc01 commented

Appears to drop from df['ATOM'] on only 'residue_number' rather than including 'insertion.'

rasbt commented

Could you explain or illustrate that with an example, i.e., the current behavior vs. what you would expect? That would be very helpful.

tbc01 commented

EDIT: This only is an issue when reading the pdb from a file... not when you use fetch_pdb
ppdb = PandasPdb().read_pdb(PDB_PATH + '2d7t.pdb')
sequence = ppdb.amino3to1()
''.join(sequence.loc[sequence['chain_id']=='H','residue_name'])
Note that residues 82A=R, 82B=R,82C=L are missing. Dropping duplicates on 'residue_number','insertion' (rather than just residue_number) from ppdb.df['ATOM'] gives the correct sequence.

rasbt commented

I see now ...

I think doing that on insertion codes as well may be useful, indeed. In the case of 2d7b (the online version in the RCSB PDB) repo, it wouldn't help though because they didn't use insertion codes.

Instead of identifying unique residues based on just the residue number, doing that based on chain ID and residue name as you did might work though. I could easily add sth like this to the amino3to1() function if that's helpful.

tbc01 commented

I remember-- the reason the insertion codes are in there is that i downloaded these files from the SAbDab database: http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/ ... I used the files that have been renumbered to chothia numbering, which includes insertion codes.

Here is an example (had to rename to .txt) of the file i was talking about where current functionality appears to skip the 3 amino acids i discussed above.:
2d7t.txt

rasbt commented

I think if insertion codes are present, it would certainly be useful to consider them. I will modify the 3to1 method to include these, which should then at least help with some of the files. If you have any additional suggestions, please let me know!

tbc01 commented

That should do it. Thanks!

rasbt commented

Let me reopen the issue until the PR is merged so we won't forget

rasbt commented

Just to make sure that I got it right, the sequence you'd expect for the following code on your downloaded file would be

ppdb = PandasPdb().read_pdb('2d7t.pdb')
sequence = ppdb.amino3to1()
sequence[50:60]['residue_name']
383    I
391    N
399    P
406    K
415    S
421    G
425    D
433    T
440    N
448    Y

? Note that it had an insertion code after residue 52, so residue 52A is supposed to be the Proline at third position

tbc01 commented

Correct: currently
ppdb = PandasPdb().read_pdb('2d7t.pdb') sequence = ppdb.amino3to1() sequence[50:60]['residue_name']

Generates:
383 I 391 N 406 K 415 S 421 G 425 D 433 T 440 N 448 Y 460 A

So looks like what you have now is good to go.

rasbt commented

alright. just made a minor version update that should be available from PyPI now (0.2.5) that includes the fix. Thanks for the hint!