openkinome/kinodata

Explore kinase mutation annotation in ChEMBL

schallerdavid opened this issue · 1 comments

We need to check how well ChEMBL annotates mutations. Mutant information can be found in the columns Assay Variant Mutation and Assay Variant Accession.

Example data point:

https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/molecule_chembl_id%3A(%22CHEMBL3354189%22)%20AND%20standard_type%3A(%22Ki%22)

With respect to this data point (CHEMBL3354189), target wt sequences are stored according to the CHEMBL schema in component_sequences.sequence. i.e. no matter what is stored in variant_sequences.mutation, the wt sequence is repeated in component_sequences.sequence.

The mutated sequence is found in variant_sequences.sequence. However, if the record for the target has a wt sequence or if it's any type of mutation other than a substitution mutation (e.g. deletion), this field is left blank.

The field variant_sequences.mutation seems reliable with respect to variant_sequences.sequence as there only seems to be a record for the mutation in the former if it is accompanied with a sequence in the latter. And when there is a wt sequence in component_sequences.sequence, as with variant_sequences.sequence, this field is generally left blank.

CAVEATS:

CHEMBL3354189 was tested against a double mutant (L858R,T790M) of EGFR, as found in variant_sequences.mutation. However, both mutations were found in different positions in the mutation sequence in variant_sequences.sequence (L865R,T798M) - a difference of 8 residues in each case. So it may be that whilst this field is reliable to confirm that there were mutations, it may not be a reliable indicator if where those mutations occurred in all cases.

With respect to deletion/addition mutations, for CHEMBL3354189, there is a value of UNDEFINED MUTATION in variant_sequences.mutation and assays.description contains information confirming it is a deletion mutation. This is not always the case with other data points so it would appear that there is no reliable correspondence between assays.description and variant_sequences.mutation to fully delineate deletion mutations.

So, to pcode this out, when retrieving sequence data for model training,

if protein sequence == WT

select component_sequences.sequence

elif protein sequence == SUBSTITUTION_MUTATION

select variant_sequences.sequence

Where WT can be determined if there is a record in component_sequences.sequence and SUBSTITUTION_MUTATION can be determined if there is both a record in component_sequences.sequence and variant_sequences.sequence, as it seems CHEMBL enforces that one can only input a mutation sequence if one has already inputted a wt sequence. Other types of mutation seem like they would be determined only unreliably from, say, assays.description.