chaidiscovery/chai-lab

Making CIF metrics usable

Opened this issue ยท 13 comments

We recently switched from PDB > CIF with two main motivations:

  • store metrics (like pLDDT) without abusing b-factor, and hopefully store pairwise interaction metrics as well (PAE etc)
  • better compatibility with current/future software pipelines (but see #49)

We followed recommendation to use modelcif, and specifically _ma_qa_metric, but I don't see a way to color by this field in pymol; and don't see pairwise metrics in pymol either.

I need help figuring out what's the right pipeline for users.

@aozalevsky @brindakv recommendations are very welcome

afaik, direct support of _ma_qa_metric is implemented in Mol* (which you already know since you're using it in your web service) and ChimeraX. In Mol* pLDDT is automatically recognized as one of the Validation attributes and colored according to pLDDT Confidence colorscheme. In ChimeraX you can color a model using the following command:

color by pLDDT_score atoms palette alphafold

where pLDDT comes from a _ma_qa_metric and _score is added later in the code as a label.

All scripts for pLDDT coloring for PyMOL rely on the _atom_site.B_iso_or_equiv (or simply B-factors) field since it's been a common (though hacky) practice. DeepMind is also populating both _ma_qa_metric and _atom_site.B_iso_or_equiv for backwards compatibility. Seems that you also opted for a similar strategy.

As for PyMOL, seems like only the minimal useful information from PDBx/mmCIF and BinaryCIF is extracted. I wonder if @JarrettSJohnson would be willing to chime in. Would be really nice to have at least per-residue (local) scores from _ma_qa_metric supported in all major visualization packages.

As for PAE, it's already supported in the ModelCIF as _ma_qa_metric_local_pairwise, but judging from the repos, none of the aforementioned tools support reading it from the CIF file. ChimeraX already has PAE visualization (as downloaded from AFDB) tools in place. Maybe @tomgoddard can tell if the current _ma_qa_metric metric support can be further extended for reading and visualizing pairwise (PAE) scores.

The low-level support for ModelCIF (including reading/writing metrics) is already implemented in the https://github.com/ihmwg/python-modelcif

Showing _ma_qa_metric_local_pairwise in ChimeraX is not related to the capability added by @e-pettersen in April 2023 to show per-residue attributes from _ma_qa_metric since a residue-pair score would be handled entirely differently than a score on individual residues. But I have used the AlphaFold PAE plot to show residue-residue distances as shown here

 https://rbvi.github.io/chimerax-recipes/rrdist/rrdist.html

That code just rewrote the distances in AlphaFold PAE JSON file format and read the file. I could possibly try some Python code in ChimeraX that reads the _ma_qa_metric_local_pairwise table from a ModelCIF entry and show it with the AlphaFold PAE plot in ChimeraX. Could you point me to a ModelCIF file that has that table? The dictionary description of the table suggests it has all the information needed to display like PAE data.

@arogozhnikov @tomgoddard i took one of the examples from the Chai-1 webserver and generated a CIF file with embedded PAEs.

There are a few comments, though:

  • my example has only protein-protein PAEs. The reason is that models.qa_metric.LocalPairwise is defined on the per-residue level, but as @arogozhnikov pointed out, they use per-atom PAEs for ligands and non-canonical amino acids/nucleotides. I've opened the issue ihmwg/python-modelcif#38

  • there are no dictionary requirements/definitions related to the format of PAEs. Technically, a PAE matrix should be a symmetric square matrix. Still, at the moment, there are no guidelines on whether the full matrix or only the upper triangular part has to be populated. It also seems that only selected scores could populated (as in my example - i ignored protein-ligand PAEs). To continue the discussion I've created ihmwg/ModelCIF#19
    I

The ChimeraX PAE plotting for AlphaFold 2 and AlphaFold 3 assumes that the matrix rows and columns include all residues in order and for AlphaFold 3 all atoms for all non-standard residues. I believe in AlphaFold 3 the PAE files contain residues numbers, chain ids, not sure about atom names, for each PAE row, but ChimeraX is not using those because the ChimeraX code was written for early AlphaFold 2 PAE files that had none of that information labeling the matrix rows. Also the PAE matrix is not symmetric in AlphaFold since the PAE value is defined by aligning the row residue/atom and then giving an error for the placement of the column residue/atom, so it is by definition not symmetric. So ChimeraX has no provision to handle just one triangle of a symmetric matrix. All these ChimeraX PAE limitations could be fixed for ModelCIF. My purpose in this comment is just to explain how ChimeraX PAE plotting is currently based entirely on AlphaFold (and ESMFold) PAE conventions.

First of all let me state that it is great to see the interest in properly storing the quality metrics using ModelCIF. Given my involvement in ModelCIF and ModelArchive (MA), I can hopefully provide some useful input here.

  1. Here are two somewhat representative examples for the use of ma_qa_metric_local_pairwise in MA: ma-bak-cepc-0944, ma-dm-hisrep-003. In all cases in MA, the local pairwise scores are extracted into a separate file to keep the main ModelCIF file at a manageable size. The ModelCIF categories ma_associated_archive_file_details and ma_entry_associated_files cleanly define the location of that extra file (via item file_content == "local pairwise QA scores"). The python-modelcif library can automatically split ModelCIF files with pairwise scores when dumping them (see e.g. the code used to create the ma-dm-hisrep ModelCIF files).
  2. In terms of providing data, mmCIF generally does not guarantee that data is provided for each atom, each residue, each residue-pair or anything else (unless it's mandatory data according to the dictionary definition). Similarly, there are no requirements on the order of the data. So ma_qa_metric_local_pairwise can definitely contain only a subset of the PAE matrix and we had such cases in ModelArchive. The two examples provided above cover both cases with ma-bak-cepc-0944 only storing inter-chain pairwise scores and ma-dm-hisrep-003 storing the full PAE matrix.
  3. In terms of ensuring that ModelCIF files are valid (which is unfortunately not always the case with AlphaFold output), we have a validation tool based on RCSB's CifCheck tool which we highly recommend to use.
  4. In terms of supporting ModelCIF and MA PAE data in ChimeraX I wanted to write a long time ago to suggest this as an improvement but somehow forgot to do it. My bad. I will write to chimerax-users@cgl.ucsf.edu with further info asap (update: link to thread in ChimeraX-users).
  5. To support the new quality estimates introduced with AF3, we will need to extend ModelCIF first (as was already observed in earlier comments). I will make a suggestion for it in https://github.com/ihmwg/ModelCIF/issues (see ihmwg/ModelCIF#21). In a nutshell, I believe that it will require new allowed values in _ma_qa_metric.type and _ma_qa_metric.mode and new categories for per-chain, per-chain-pair, per-atom, and per-atom-pair scores. For the latter, I am uncertain how to best handle per-residue tokens and would be happy for your input. My suggestion would be to simply only use the "token centre atom" (as in 2.6 of Supplementary information for AlphaFold 3; CA for standard amino acids, C1' for standard nucleotides) for per-residue tokens and hence only sparsely populate the per-atom-pair scores table. The alternative would require a mix of per-atom-pair and per-residue-pair data which I think would be more confusing I think.

ChimeraX plots the AlphaFold 3 PAE scores and handles its combination of residue and atom level values (one PAE value per residue for standard amino and nucleic acids, and one PAE value per atom for all other atoms). Specifically it knows that the per-residue score is a residue score, not just a score for the C-alpha atom. If you treat it as a C-alpha atom score you lose some knowledge about the meaning of the score. I agree it is messy to handle though.

I made an example ChimeraX command to plot pairwise residue scores included directly in ModelCIF structure files described here

https://rbvi.github.io/chimerax-recipes/modelcif_pae/modelcif_pae.html

Yeah the "meaning of AlphaFold 3 PAE scores" is a tricky one here. In terms of what is being predicted, it relates only to the CA/C1' atoms for standard amino and nucleic acids. But: I could make the same argument about per-residue pLDDT and PAE in AF2. In terms of interpretation of the score (i.e. making it useful), I of course agree that PAE values for standard amino and nucleic acids relate to the whole residue and not just to CA/C1' atoms.

Anyway I updated my suggested update of ModelCIF (ihmwg/ModelCIF#21) with an idea how one could represent QE metrics which are listed in separate ModelCIF tables. I.e. the same PAE metric would then have values in _ma_qa_metric_per_atom_pairwise and _ma_qa_metric_local_pairwise. From the perspective of the PAE writers and parsers this option would then also mean that data needs to be written in separate tables in separate ways for per-residue-pair and per-atom-pair PAEs.

A simple idea for making the mmCIF tables handle pairwise scores between atoms and residues is to add two atom name fields to the _ma_qa_metric_local_pairwise table

https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Categories/ma_qa_metric_local_pairwise.html

If the atom name was given as "." then that would mean the score applies to the whole residue, while if an atom is named e.g. "CA" then the score is associated with that atom. This allows a pair score to be between an atom and a residue as it is done in AlphaFold 3. I think it will be important to handle AlphaFold 3 PAE scores. They will probably be the most heavily used pairwise scores in the next few years.

If the atom name was given as "." then that would mean the score applies to the whole residue, while if an atom is named e.g. "CA" then the score is associated with that atom. This allows a pair score to be between an atom and a residue as it is done in AlphaFold 3. I think it will be important to handle AlphaFold 3 PAE scores. They will probably be the most heavily used pairwise scores in the next few years.

we have a somewhat similar concept in ihm for crosslinks (which are also, typically, are pairwise restraints) _ihm_cross_link_restraint.model_granularity. Theoretically, similar modifiers could be applied for both ends of a restraint/score. However, maintaining the by-feature level (basically any custom selection other than standard things like an atom, full residue, or a full asym_unit) is tricky and requires multiple additional tables.

Agree with the above. Supporting AF3 PAE will be important and we need to make sure that whatever we implement can handle pairs between single atoms and full residues. I will update the suggestion in ihmwg/ModelCIF#21 accordingly. One worry of adding more columns to ma_qa_metric_local_pairwise is that it makes it more bulky while a pure per-atom table could be linked to atom_site.id instead. Additionally the current per-residue identification would not be suitable to handle QE for water molecules (unless we deviate from PDB standards and give each water molecule a separate label_asym_id) which are relevant for physics-based docking tools.