hkmztrk/DeepDTA

Regarding data leakage caused by amino acid sequences in the data set

Closed this issue · 1 comments

In the data/davis/Proteins.txt file, different mutations of the EGFR gene (such as G719C, G719S, L747E749del) appear to have the same amino acid sequence. This raised a concern for me: why were these genetic mutations not reflected in the amino acid sequence? Furthermore, these samples all have the same affinity in the dataset. Randomly divide 20% as the test data set. About 25% of the data in the test data has appeared in the training data. I am worried that there is a potential risk of data leakage in the model based on these sequences.
1706088241770

Hello, I see that this issue was marked as "completed" but I see no reply, what was the response?
Thank you