Extension of ModelCIF for AF3 quality estimates
Closed this issue · 9 comments
Related to #20 and the issues mentioned in there, I would suggest to extend ModelCIF to capture all new types of quality estimates introduced with AlphaFold 3 (AF3). I also had a look at RoseTTAFold-AllAtom and the suggestions below would also capture anything needed there. I also believe that this should cover anything needed for chaidiscovery/chai-lab#52. Here is my suggested additions:
- Extend
_ma_qa_metric.type
to include:- "pLDDT to polymer" with detailed description "confidence score predicting accuracy according to lDDT with distances from each atom to CA or C1' of nearby polymer residues in [0,100]"
- "boolean" with detailed description "0 or 1 depending on whether a check passed (1) or not (0)."
- Extend
_ma_qa_metric.mode
to include "per-chain", "per-chain-pairwise", "per-atom" and "per-atom-pairwise" (and yes I know it's a bit unfortunate that we used "local" for "per-residue" but ok...) - New
_ma_qa_metric_per_chain
same as_ma_qa_metric_local
but withoutlabel_comp_id
andlabel_seq_id
- New
_ma_qa_metric_per_chain_pairwise
same as_ma_qa_metric_local_pairwise
but withoutlabel_comp_id*
andlabel_seq_id*
- New
_ma_qa_metric_per_atom
same as_ma_qa_metric_local
but using atom_id (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
- New
_ma_qa_metric_per_atom_pairwise
same as_ma_qa_metric_local_pairwise
but but using atom_id_1 and atom_id_2 (linked to_atom_site.id
) instead ofmodel_id
andlabel_*
Concretely for AF3 output (e.g. looking at the JSON files in one of their examples) here is how each of the scores would map to a _ma_qa_metric.mode
and .type
:
fraction_disordered
: "global", "normalized score"has_clash
: "global", "boolean"iptm
: "global", "ipTM"ptm
: "global", "pTM"ranking_score
: "global", "normalized score"chain_ptm
: "per-chain", "pTM"chain_iptm
: "per-chain", "ipTM"chain_pair_iptm
: "per-chain-pairwise", "ipTM"chain_pair_pae_min
: "per-chain-pairwise", "PAE"atom_plddts
: "per-atom", "pLDDT to polymer"contact_probs
: "per-atom-pairwise", "contact probability"pae
: "per-atom-pairwise", "PAE"
Some caveats to consider:
contact_probs
andpae
above are defined per "token" pair, where a token is either a full residue (for standard amino and nucleic acids) or a single atom otherwise. In AF3, the per-residue tokens have a well defined "token centre atom" (CA for standard amino acids, C1' for standard nucleotides) which could be used in per-atom scores but this may be confusing.- The "per-chain" scores also apply to non-polymers which may be a confusing naming. Technically "per-asym-id" is more correct although that may be only understandable by mmCIF experts.
- For future applications in physics-based docking tools, we need to make sure that local scores can identify water molecules. In PDB those all share
label_asym_id
and do not have alabel_seq_id
and one could also change that to giving them separatelabel_asym_id
in ModelCIF to fix this.
Alternative to the above (which simplifies some things and handles the per token scores):
- Extend
_ma_qa_metric_local
and_ma_qa_metric_local_pairwise
to includelabel_atom_id
(linked to_atom_site.label_atom_id
) which can be set to '.' for per-residue scores. - One could also handle per-chain scores by allowing
label_comp_id
andlabel_seq_id
to be set to '.'. - With appropriate updates to the category and item descriptions, all types of local scores could be handled by the
_ma_qa_metric_local
and_ma_qa_metric_local_pairwise
tables and no additional tables or_ma_qa_metric.mode
values would be necessary.
@brindakv what are your thoughts on this?
Notes from discussions with @benmwebb , @brindakv and @aozalevsky (on Oct. 16):
- Not good to add link to
_atom_site.label_atom_id
to _ma_qa_metric_local and _ma_qa_metric_local_pairwise as it overloads the tables and still doesn't enable clean handling of non-polymers (which is critical for AF3) - Alternative discarded suggestion was to link to
_atom_site.id
with a flag for granularity (atom, residue or chain). Pro: easy to use and look at. Con: cannot generalize to other features (e.g. residue ranges, domains, ...) and ambiguous on how to define (e.g. which atom to pick). - Preferred solution is to use features as in IHM's
_ihm_feature_list
Example AF3 output (cut to only include one model instead of 5): fold_test_fold_job_number_one_cut.zip. Info on content:
fold_test_fold_job_number_one_job_request.json
is input to AF3 (can be uploaded to the AF-Server)fold_test_fold_job_number_one_model_0.cif
is a (not 100% compliant) ModelCIF file. Note that copies of the same molecule (HEM, MG, and NA in this example) are handled with multiple identical molecular entities (instead of a single entity with multiple instances).fold_test_fold_job_number_one_summary_confidences_0.json
contains global, per-chain and per-chain-pair scores (see "Summary outputs" in AF-server-FAQ). Note that some values can be "null".fold_test_fold_job_number_one_full_data_0.json
contains the per-atom pLDDT and per-token-pair PAE and contact probabilities (see "Full array outputs" in AF-server-FAQ). Tokens are either a full residue (for standard amino and nucleic acids) or a single atom otherwise. Order of values is implicit according to order in atom_site of .cif file.- Chains in the model:
A
: polymer (polypeptide; seq: "PREACHINGS"), residues 1 and 5 modified (HY3, P1L)B
: polymer (polypeptide; seq: "REACHER")C
: non-polymer (ATP)D
: non-polymer (HEM)E
: non-polymer (HEM)F
: non-polymer (MG)G
: non-polymer (MG)H
: non-polymer (NA)I
: non-polymer (NA)J
: non-polymer (NA)K
: polymer (polydeoxyribonucleotide; seq: "GATTACA"), residues 1 and 2 modified (6OG, 6MA)L
: polymer (polydeoxyribonucleotide; seq: "TGTAATC")M
: polymer (polyribonucleotide; seq: "GUAC"), residues 1 and 4 modified (2MG, 5MC)N
: branched (NAG-NAG-BMA)O
: branched (BMA)
Suggested ModelCIF extension:
- Extend _ma_qa_metric.type as in first comment
- Extend _ma_qa_metric.mode to include "per-feature" and "per-feature-pair"
- New
_ma_feature_list
exactly like_ihm_feature_list
except "branched" added toentity_type
andfeature_type
which should include the following controlled vocabulary:- atom: "feature is an atom or a set of atoms for any entity type"
- residue: "feature is a residue or a set of residues from a polymeric entity"
- asym_id: "feature is an instance of a molecular entity"
- New
_ma_atom_feature
category:- Description: "Data items in this category provide the definitions required to select specific atoms independently of entity type."
- Items:
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- atom_id (mandatory): "The identifier of the atom. This data item is a pointer to _atom_site.id in the ATOM_SITE category."
- New
_ma_poly_residue_feature
category:- Description: "Data items in this category provide the definitions required to select specific polymer residues."
- Items (similar to ma_qa_metric_local):
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
- label_comp_id (mandatory): "The component identifier for the residue in the structural model. This data item is a pointer to _atom_site.label_comp_id in the ATOM_SITE category."
- label_seq_id (mandatory): "The identifier for the sequence index of the residue in the structural model. This data item is a pointer to _atom_site.label_seq_id in the ATOM_SITE category."
- New
_ma_asym_id_feature
category:- Description: "Data items in this category provide the definitions required to select specific instances of a molecular entity independently of entity type (e.g. a polymer chain or a copy of a non-polymer)."
- Items (similar to _ma_poly_residue_feature):
- ordinal_id (key): "A unique identifier for the category."
- feature_id (mandatory): "An identifier for the selected feature. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- label_asym_id (mandatory): "The identifier for the asym id of the residue in the structural model. This data item is a pointer to _atom_site.label_asym_id in the ATOM_SITE category."
- New
_ma_qa_metric_feature
category (similar to ma_qa_metric_local):- Description: "Data items in this category capture QA metrics calculated per feature (as defined in _ma_feature_list)."
- Items:
- ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local
- feature_id (mandatory): "The identifier for the feature, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- New
_ma_qa_metric_feature_pairwise
category (similar to ma_qa_metric_local_pairwise):- Description: "Data items in this category capture QA metrics calculated per pair of features (as defined in _ma_feature_list)."
- Items:
- ordinal_id (key), metric_id, metric_value, model_id (all mandatory) exactly as in ma_qa_metric_local_pairwise
- feature_id_1 (mandatory): "The identifier for the first feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- feature_id_2 (mandatory): "The identifier for the second feature in the pair, for which QA metric is provided. This data item is a pointer to _ma_feature_list.feature_id in the MA_FEATURE_LIST category."
- Note: if it is preferred to use something else instead of "asym_id" in the category name and feature_type, that's also ok...
@gtauriello, I just wanted to follow up on this. With AF3 code and weights being released and with the recent addition of restraints to Chai-1, we can expect rapid growth in the number of deposited models. Would be nice to have the scores in those models.
I agree. @brindakv was waiting for me to decide on a separate issue that we wanted to address in the same ModelCIF update and now I added that here as issue #23 . Hence, I think that she can now do the updates according to the open issues here.
Afterwards, we can try to suggest changes in alphafold3/model/mmcif_metadata.py to include this (and check if other things are invalid in their files).
@gtauriello please clarify my questions below.
- Do we need
_ma_poly_residue_feature
considering thatma_qa_metric_local
sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it. - Do we want
ma_feature_list.feature_type
to support contiguous residue ranges? If yes, then_ma_poly_residue_feature
can have begin and end data items forseq_id
andcomp_id
. - What is the use case for
ma_qa_metric.type
=boolean
? Should this be a separate data item elsewhere rather than an enumeration ofma_qa_metric.type
?
1. Do we need `_ma_poly_residue_feature` considering that `ma_qa_metric_local` sort of already handles it? The difference would be the ability to assign multiple residues to a feature. If this is a use case, then we can add it.
The main use case for it is to be able to handle pairs between an atom and a residue in ma_qa_metric_feature_pairwise
(needed for AF3's PAE matrix). We would not be able to do it in any other way.
2. Do we want `ma_feature_list.feature_type` to support contiguous residue ranges? If yes, then `_ma_poly_residue_feature` can have begin and end data items for `seq_id` and `comp_id`.
This would make the main existing use case in AF3 more verbose than necessary (we need a feature for each polymer residue to handle the PAE matrix) while I currently do not have a use case for contiguous residue ranges. If we need those ranges in the future, I would prefer to have them in a separate table.
3. What is the use case for `ma_qa_metric.type` = `boolean`? Should this be a separate data item elsewhere rather than an enumeration of `ma_qa_metric.type`?
The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash
. I would like to be able to properly store all components of that and has_clash
is a boolean pass/fail score (1 = pass, 0 = fail).
Thanks for clarifying @gtauriello.
The default ranking score in AF3 is calculated as 0.8 × ipTM + 0.2 × pTM + 0.5 × disorder − 100 × has_clash. I would like to be able to properly store all components of that and has_clash is a boolean pass/fail score (1 = pass, 0 = fail).
Should the enumeration for ma_qa_metric.type
be has_clash
or boolean
?
Never mind. Boolean is good.
@gtauriello I suggest we add enumerations to _ma_associated_archive_file_details.file_content
and _ma_entry_associated_files.file_content
.
It can be generic (QA metrics
) or specific (feature-based QA scores
).
For ma_qa_metric.type
: yes for boolean
as you concluded already.
For file_content
: I had not noticed that one but it is an excellent point. I would go for the generic (QA metrics
) option and add a note for local pairwise QA scores
that this is deprecated in favor of QA metrics
.
Thanks @gtauriello. Updates have been committed, please see #25.