ihmwg/ModelCIF

Extension of ModelCIF for AF3 quality estimates

Closed this issue · 9 comments

Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for chaidiscovery/chai-lab#52. Here is my suggested additions:

  1. Extend _ma_qa_metric.type to include:
    • "pLDDT to polymer" with detailed description "confidence score predicting accuracy according to lDDT with distances from each atom to CA or C1' of nearby polymer residues in [0,100]"
    • "boolean" with detailed description "0 or 1 depending on whether a check passed (1) or not (0)."
  2. Extend _ma_qa_metric.mode to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...)
  3. New _ma_qa_metric_per_chain same as _ma_qa_metric_local but without label_comp_id and label_seq_id
  4. New _ma_qa_metric_per_chain_pairwise same as _ma_qa_metric_local_pairwise but without label_comp_id* and label_seq_id*
  5. New _ma_qa_metric_per_atom same as _ma_qa_metric_local but using atom_id (linked to _atom_site.id) instead of model_id and label_*
  6. New _ma_qa_metric_per_atom_pairwise same as _ma_qa_metric_local_pairwise but but using atom_id_1 and atom_id_2 (linked to _atom_site.id) instead of model_id and label_*

Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a _ma_qa_metric.mode and .type:

  • fraction_disordered: "global", "normalized score"
  • has_clash: "global", "boolean"
  • iptm: "global", "ipTM"
  • ptm: "global", "pTM"
  • ranking_score: "global", "normalized score"
  • chain_ptm: "per-chain", "pTM"
  • chain_iptm: "per-chain", "ipTM"
  • chain_pair_iptm: "per-chain-pairwise", "ipTM"
  • chain_pair_pae_min: "per-chain-pairwise", "PAE"
  • atom_plddts: "per-atom", "pLDDT to polymer"
  • contact_probs: "per-atom-pairwise", "contact probability"
  • pae: "per-atom-pairwise", "PAE"

Some caveats to consider:

  • contact_probs and pae above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.
  • The "per-chain" scores also apply to non-polymers which may be a confusing naming. Technically "per-asym-id" is more correct although that may be only understandable by mmCIF experts.
  • For future applications in physics-based docking tools, we need to make sure that local scores can identify water molecules. In PDB those all share label_asym_id and do not have a label_seq_id and one could also change that to giving them separate label_asym_id in ModelCIF to fix this.

Alternative to the above (which simplifies some things and handles the per token scores):

  • Extend _ma_qa_metric_local and _ma_qa_metric_local_pairwise to include label_atom_id (linked to _atom_site.label_atom_id) which can be set to '.' for per-residue scores.
  • One could also handle per-chain scores by allowing label_comp_id and label_seq_id to be set to '.'.
  • With appropriate updates to the category and item descriptions, all types of local scores could be handled by the _ma_qa_metric_local and _ma_qa_metric_local_pairwise tables and no additional tables or _ma_qa_metric.mode values would be necessary.

@brindakv what are your thoughts on this?

Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):

  • Not good to add link to _atom_site.label_atom_id to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3)
  • Alternative discarded suggestion was to link to _atom_site.id with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick).
  • Preferred solution is to use features as in IHM's _ihm_feature_list

Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:

  • fold_test_fold_job_number_one_job_request.json is input to AF3 (can be uploaded to the AF-Server)
  • fold_test_fold_job_number_one_model_0.cif is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).
  • fold_test_fold_job_number_one_summary_confidences_0.json contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".
  • fold_test_fold_job_number_one_full_data_0.json contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.
  • Chains in the model:
    • A: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)
    • B: polymer (polypeptide; seq: "REACHER")
    • C: non-polymer (ATP)
    • D: non-polymer (HEM)
    • E: non-polymer (HEM)
    • F: non-polymer (MG)
    • G: non-polymer (MG)
    • H: non-polymer (NA)
    • I: non-polymer (NA)
    • J: non-polymer (NA)
    • K: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)
    • L: polymer (polydeoxyribonucleotide; seq: "TGTAATC")
    • M: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)
    • N: branched (NAG-NAG-BMA)
    • O: branched (BMA)

Suggested ModelCIF extension:

  • Extend _ma_qa_metric.type as in first comment
  • Extend _ma_qa_metric.mode to include "per-feature" and "per-feature-pair"
  • New _ma_feature_list exactly like _ihm_feature_list except "branched" added to entity_type and feature_type which should include the following controlled vocabulary:
    • atom: "feature is an atom or a set of atoms for any entity type"
    • residue: "feature is a residue or a set of residues from a polymeric entity"
    • asym_id: "feature is an instance of a molecular entity"
  • New _ma_atom_feature category:
    • Description: "Data items in this category provide the definitions required to select specific atoms independently of entity type."
    • Items:
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • atom_id (mandatory): "The identifier of the atom. This data item is a pointer to _atom_site.id in the ATOM_SITE category."
  • New _ma_poly_residue_feature category:
    • Description: "Data items in this category provide the definitions required to select specific polymer residues."
    • Items (similar to ma_qa_metric_local):
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
      • label_comp_id (mandatory): "The component identifier for the residue in the structural model. This data item is a pointer to _atom_site.label_comp_id in the ATOM_SITE category."
      • label_seq_id (mandatory): "The identifier for the sequence index of the residue in the structural model. This data item is a pointer to _atom_site.label_seq_id in the ATOM_SITE category."
  • New _ma_asym_id_feature category:
    • Description: "Data items in this category provide the definitions required to select specific instances of a molecular entity independently of entity type (e.g. a polymer chain or a copy of a non-polymer)."
    • Items (similar to _ma_poly_residue_feature):
      • ordinal_id (key): "A unique identifier for the category."
      • feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
  • New _ma_qa_metric_feature category (similar to ma_qa_metric_local):
    • Description: "Data items in this category capture QA metrics calculated per feature (as defined in _ma_feature_list)."
    • Items:
      • ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local
      • feature_id (mandatory): "The identifier for the feature, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
  • New _ma_qa_metric_feature_pairwise category (similar to ma_qa_metric_local_pairwise):
    • Description: "Data items in this category capture QA metrics calculated per pair of features (as defined in _ma_feature_list)."
    • Items:
      • ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local_pairwise
      • feature_id_1 (mandatory): "The identifier for the first feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
      • feature_id_2 (mandatory): "The identifier for the second feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
  • Note: if it is preferred to use something else instead of "asym_id" in the category name and feature_type, that's also ok...

@gtauriello, I just wanted to follow up on this. With AF3 code and weights being released and with the recent addition of restraints to Chai-1, we can expect rapid growth in the number of deposited models. Would be nice to have the scores in those models.

I agree. @brindakv was waiting for me to decide on a separate issue that we wanted to address in the same ModelCIF update and now I added that here as issue #23 . Hence, I think that she can now do the updates according to the open issues here.

Afterwards, we can try to suggest changes in alphafold3/model/mmcif_metadata.py to include this (and check if other things are invalid in their files).

@gtauriello please clarify my questions below.

  1. Do we need _ma_poly_residue_feature considering that ma_qa_metric_local sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.
  2. Do we want ma_feature_list.feature_type to support contiguous residue ranges? If yes, then _ma_poly_residue_feature can have begin and end data items for seq_id and comp_id.
  3. What is the use case for ma_qa_metric.type = boolean? Should this be a separate data item elsewhere rather than an enumeration of ma_qa_metric.type?
1. Do we need `_ma_poly_residue_feature` considering that `ma_qa_metric_local` sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.

The main use case for it is to be able to handle pairs between an atom and a residue in ma_qa_metric_feature_pairwise (needed for AF3's PAE matrix). We would not be able to do it in any other way.

2. Do we want `ma_feature_list.feature_type` to support contiguous residue ranges? If yes, then `_ma_poly_residue_feature` can have begin and end data items for `seq_id` and `comp_id`.

This would make the main existing use case in AF3 more verbose than necessary (we need a feature for each polymer residue to handle the PAE matrix) while I currently do not have a use case for contiguous residue ranges. If we need those ranges in the future, I would prefer to have them in a separate table.

3. What is the use case for `ma_qa_metric.type` = `boolean`? Should this be a separate data item elsewhere rather than an enumeration of `ma_qa_metric.type`?

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

Thanks for clarifying @gtauriello.

The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).

Should the enumeration for ma_qa_metric.type be has_clash or boolean?

Never mind. Boolean is good.

@gtauriello I suggest we add enumerations to _ma_associated_archive_file_details.file_content and _ma_entry_associated_files.file_content.

It can be generic (QA metrics) or specific (feature-based QA scores).

For ma_qa_metric.type: yes for boolean as you concluded already.

For file_content: I had not noticed that one but it is an excellent point. I would go for the generic (QA metrics) option and add a note for local pairwise QA scores that this is deprecated in favor of QA metrics.

Thanks @gtauriello. Updates have been committed, please see #25.