nrbennet/dl_binder_design

differenct result of the AF2 initial guess for the same protein

jongseo-park opened this issue · 6 comments

Hi,

I attempted to create a pyrosetta-free version of this tool. The modified code worked well, but I encountered an issue.

For structure modeling, the original tool uses pyrosetta's dump_pdb to generate a side chain for the ProteinMPNN-designed sequence (Line 125 in af2_util.py). I replaced dump_pdb with openmm's addResidue, and finally, I obtained full-atom structures, and the initial guess worked fine.

However, the structure modeling result (pAE value) is approximately 7-8 for the structure originated from dump_pdb, but it is about 27-28 for the one originated from addResidue. The only differences between the two structures were the coordinates of side chain atoms; backbone atoms remained the same.

Moreover, when I substituted that temporary structure with a side chain re-packed one using rosetta, the resulting pAE value was around 27-28, indicating a non-binder.

I am curious whether the pyrosetta-generated side chain coordination is essential for the AF2 initial guess or not.

Hi, the code assumes that your structure is numbered in a certain way and that the binder and target are in a certain order. I would check that the output of dump_pdb and addResidue are identical at the backbone level. AF2 recycling ignores side chains so the side chains you specify should not be changing the result

Thank you for your response.

The residue numbers have been assigned sequentially, and the chain order has been properly designated.

Upon reviewing the backbone atoms, I found that the xyz coordinates (N, CA, C, O) were consistent in both cases.

I have attached sample PDB files generated from OpenMM and PyRosetta. These are not my intended binder design targets but have been used solely for atom verification purposes.

omm_pyr_PDBs.zip



code for generating PDBs

from pyrosetta import *
from rosetta import *
init( '-in:file:silent_struct_type binary -mute all' )


from pdbfixer import PDBFixer
from openmm.app import PDBFile


# openmm addResidue

def omm_dump_pdb(inp, out):

    pdb_fn = inp
    outpdb_fn = out

    fixer = PDBFixer(filename=pdb_fn)

    fixer.findMissingResidues()
    fixer.findMissingAtoms()
    fixer.addMissingAtoms()
    fixer.addMissingHydrogens(7.0)

    PDBFile.writeFile(fixer.topology, fixer.positions, open(outpdb_fn, 'w'), keepIds=True)
    
    
    
# Pyrosetta

def pyr_dump_pdb(inp, out):
    pose = pose_from_file(inp)
    pose.dump_pdb(out)



inp = '1L58_bb.pdb'
omm_out = '1L58_fullatom_omm.pdb'
pyr_out = '1L58_fullatom_pyr.pdb'

omm_dump_pdb(inp, omm_out)

pyr_dump_pdb(inp, pyr_out)



In my binder design task ...

123123

Thanks for sending these structures, that is helpful. Those structures look correct. Can you send an example of a complex output from writer?

This is an example file having significant differences in the pAE values when using OpenMM and PyRosetta, respectively.

I am attaching two processed full-atom files resulting from the above code.

Those both look correct. That's strange. If you want to get to the bottom of this, I would assert that the actual features going into the model are the same in each case. Differences there will point you in the correct direction

Upon reviewing the numpy array all_atom_positions, discrepancies were observed not in the backbone coordinates but in the side chain coordinates. Consequently, this variance also influences the initial_guess values.

Initially, I regarded these differences as subtle upon examining various aspects, thus overlooking their significance. However, it has become evident that the disparities in side chain coordinates significantly contribute to the contrasting outcomes.

After loading the np.array of all_atom_positions originated from the PyRosetta version into the OpenMM version of this tool, the pAE values are nearly identical in both cases.

I'm perplexed by how such seemingly minor discrepancies in side chains (or are they substantial concerning AI models?) can wield such a substantial influence on the results of alphafold modeling.

Furthermore, I'm uncertain about which set of pAE values derived from the two methods should be trusted. Given that your code implemented using PyRosetta has undergone testing across multiple scenarios, would it be more prudent to place trust in its results?