dauparas/ProteinMPNN

Indexing with Missing Residues

brucejwittmann opened this issue · 4 comments

Hey! Another quick question: How does indexing work for ProteinMPNN when there are missing residues present in the PDB file? It looks like it will add "X" to the output sequences where it identifies gaps in the residue indexing, but how is that reflected in indexing? For instance, if I had a PDB file with 2 residues present but 5 missing in between them, I would get sequences of the below format out:

NXXXXXN

The first "N" will have index 1, but what about the second? Is it "2" (because it's the second residue in the PDB file) or "7" (because it's the seventh index in the PDB file if we account for the missing residues). I'm getting some weird outputs when I tie residues together assuming the first indexing scheme (where the Ns are indices 1 and 2 rather than 1 and 7).

Extending the above, what if I have missing residues at the beginning or end of the PDB sequence? For instance, say in the same example above I have 2 residues missing from the beginning and 3 missing from the end in addition to the 3 in the middle. I believe I would get exactly the same result as the first case (NXXXXXN), but would the indices of the first "N" change? Would it be "3" now to account for the added residues at the front?

From what I've seen the second "N" would be at position 2. What I do is I just replace the "X" with "" and then count.

sequence_no_gaps = sequence_with_gaps.replace("X","")

That said, I forget if the index starts at 0 or at 1. I think it starts at 0. So

--position_list "0 1"

rather than

--position_list "1 2"

As a sanity check I always use a python script to go through the output .fa file and calculate the changes. A good Python library for that is from collections import Counter.

For example,

>>> from collections import Counter
>>> word = "mississippi"
>>> counter = {}

>>> for letter in word:
            counter[letter] = counter.get(letter, 0) + 1

>>> counter
{'m': 1, 'i': 4, 's': 4, 'p': 2}

Some cool code to handle everything: https://huggingface.co/spaces/simonduerr/ProteinMPNN/blob/main/app.py#L382

protein_mpnn_utils.py, line 125; I commented out a few of the places where 'gaps' are being added for sequence numbering.
Becomes this:

try:
    for resn in range(min_resn,max_resn+1):
      if resn in seq:
        for k in sorted(seq[resn]): seq_.append(aa_3_N.get(seq[resn][k],20))
      #else: seq_.append(20)
      if resn in xyz:
        for k in sorted(xyz[resn]):
          for atom in atoms:
            if atom in xyz[resn][k]: xyz_.append(xyz[resn][k][atom])
            else: xyz_.append(np.full(3,np.nan))
      #else:
      #  for atom in atoms: xyz_.append(np.full(3,np.nan))
    print("Seq_", seq_)

Thanks everyone. @jadolfbr, based on the code you referenced, it looks like the logic is as follows for indexing:

  1. Find the smallest and largest residue indices in the PDB file, extracting from the coordinates section.
  2. Scan over the range of integers between the smallest and largest integers, recording masks/pads for missing residues.

Based on that logic, in my example with "NXXXXXN", the first "N" would be index "1" and the second index "7" assuming that there were 5 indices between them. If I were to add missing residues at the beginning or end, I would get the same indexing scheme as the first case as those residues will not show up in the coordinates section of the PDB file.

@tony-res, I believe indexing should start at 1 for ProteinMPNN. If you look at example 4 you can see that noted in a comment. Also you can see when masking fixed positions that "1" is subtracted from the input array.

I think it's important to note that the index in the code @jadolfbr copied is not the same as the index for freezing positions. The index in the copied code is direct from the PDB file, while the indices in the example I linked are 1-indexed to the relative position of a residue in the file.