ppdb.amino3to1()

Question

ppdb.amino3to1()

tbc01 opened this issue 5 years ago · 10 comments

Appears to drop from df['ATOM'] on only 'residue_number' rather than including 'insertion.'

Answer 1 · 2019-07-09T17:12:48.000Z

Could you explain or illustrate that with an example, i.e., the current behavior vs. what you would expect? That would be very helpful.

Answer 2 · 2019-07-09T17:49:38.000Z

EDIT: This only is an issue when reading the pdb from a file... not when you use fetch_pdb
ppdb = PandasPdb().read_pdb(PDB_PATH + '2d7t.pdb')
sequence = ppdb.amino3to1()
''.join(sequence.loc[sequence['chain_id']=='H','residue_name'])
Note that residues 82A=R, 82B=R,82C=L are missing. Dropping duplicates on 'residue_number','insertion' (rather than just residue_number) from ppdb.df['ATOM'] gives the correct sequence.

Answer 3 · 2019-07-09T18:17:08.000Z

I see now ...

I think doing that on insertion codes as well may be useful, indeed. In the case of 2d7b (the online version in the RCSB PDB) repo, it wouldn't help though because they didn't use insertion codes.

Instead of identifying unique residues based on just the residue number, doing that based on chain ID and residue name as you did might work though. I could easily add sth like this to the amino3to1() function if that's helpful.

Answer 4 · 2019-07-09T18:24:46.000Z

I remember-- the reason the insertion codes are in there is that i downloaded these files from the SAbDab database: http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/ ... I used the files that have been renumbered to chothia numbering, which includes insertion codes.

Here is an example (had to rename to .txt) of the file i was talking about where current functionality appears to skip the 3 amino acids i discussed above.:
2d7t.txt

Answer 5 · 2019-07-09T18:57:39.000Z

I think if insertion codes are present, it would certainly be useful to consider them. I will modify the 3to1 method to include these, which should then at least help with some of the files. If you have any additional suggestions, please let me know!

Answer 6 · 2019-07-09T19:03:10.000Z

That should do it. Thanks!

Answer 7 · 2019-07-09T19:03:49.000Z

Let me reopen the issue until the PR is merged so we won't forget

Answer 8 · 2019-07-09T19:23:24.000Z

Just to make sure that I got it right, the sequence you'd expect for the following code on your downloaded file would be

ppdb = PandasPdb().read_pdb('2d7t.pdb')
sequence = ppdb.amino3to1()
sequence[50:60]['residue_name']

? Note that it had an insertion code after residue 52, so residue 52A is supposed to be the Proline at third position

Answer 9 · 2019-07-09T19:30:38.000Z

Correct: currently
ppdb = PandasPdb().read_pdb('2d7t.pdb') sequence = ppdb.amino3to1() sequence[50:60]['residue_name']

Generates:
383 I 391 N 406 K 415 S 421 G 425 D 433 T 440 N 448 Y 460 A

So looks like what you have now is good to go.

Answer 10 · 2019-07-09T20:30:17.000Z

alright. just made a minor version update that should be available from PyPI now (0.2.5) that includes the fix. Thanks for the hint!