Explore kinase mutation annotation in ChEMBL
schallerdavid opened this issue · 1 comments
We need to check how well ChEMBL annotates mutations. Mutant information can be found in the columns Assay Variant Mutation
and Assay Variant Accession
.
Example data point:
With respect to this data point (CHEMBL3354189), target wt sequences are stored according to the CHEMBL schema in component_sequences.sequence
. i.e. no matter what is stored in variant_sequences.mutation
, the wt sequence is repeated in component_sequences.sequence
.
The mutated sequence is found in variant_sequences.sequence
. However, if the record for the target has a wt sequence or if it's any type of mutation other than a substitution mutation (e.g. deletion), this field is left blank.
The field variant_sequences.mutation
seems reliable with respect to variant_sequences.sequence
as there only seems to be a record for the mutation in the former if it is accompanied with a sequence in the latter. And when there is a wt sequence in component_sequences.sequence
, as with variant_sequences.sequence
, this field is generally left blank.
CAVEATS:
CHEMBL3354189 was tested against a double mutant (L858R,T790M) of EGFR, as found in variant_sequences.mutation
. However, both mutations were found in different positions in the mutation sequence in variant_sequences.sequence
(L865R,T798M) - a difference of 8 residues in each case. So it may be that whilst this field is reliable to confirm that there were mutations, it may not be a reliable indicator if where those mutations occurred in all cases.
With respect to deletion/addition mutations, for CHEMBL3354189, there is a value of UNDEFINED MUTATION in variant_sequences.mutation
and assays.description
contains information confirming it is a deletion mutation. This is not always the case with other data points so it would appear that there is no reliable correspondence between assays.description
and variant_sequences.mutation
to fully delineate deletion mutations.
So, to pcode this out, when retrieving sequence data for model training,
if protein sequence == WT
select component_sequences.sequence
elif protein sequence == SUBSTITUTION_MUTATION
select variant_sequences.sequence
Where WT can be determined if there is a record in component_sequences.sequence
and SUBSTITUTION_MUTATION can be determined if there is both a record in component_sequences.sequence
and variant_sequences.sequence
, as it seems CHEMBL enforces that one can only input a mutation sequence if one has already inputted a wt sequence. Other types of mutation seem like they would be determined only unreliably from, say, assays.description
.