BenevolentAI/MolBERT

Dataset size and creation

Opened this issue · 3 comments

Hi, first of all congrats on your article and the NeurIPS workshop.

I have a few questions:

  1. Regarding fine-tuning: do you update the pre-trained encoder or do you freeze it ?
  2. You say that any molecule with a ECFP4 similarity higher than 0.323 to 10 drugs was discarded. I assume this was done for generalisation. However what type of similarity did you use (Tanimoto, Dice etc) and why 0.323 ? Also have you performed any clustering based on similarity for the final dataset to ensure that the parsed chemical space is balanced ?

Hey, thanks!

  1. Where we note that we are doing finetuning we finetune the whole pre-trained encoder without freezing any layers.
  2. I think there is a confusion here, 0.323 is the performance for the best model and does not denote similarity, so I'm not sure I understand the question. Feel free to clarify! Thanks

Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
As input for your model you used the dataset published here:
To generate the final dataset for the benchmarks, ChEMBL is post-processed by

  1. removal of salts.
  2. charge neutralization.
  3. removal of molecules with SMILES strings longer than 100 characters.
  4. removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
  5. removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set."

My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used.

As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ?

Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here).

If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here).

Regarding the max_seq_length , we use relative positional encodings as described in Transformer-xl, which allows MolBERT to process sequences of arbitrary length at inference time, despite training with a fixed length vector. There is a caveat that MolBERT has not been trained for longer SMILES examples so we cannot guarantee that the model generalizes to SMILES of longer lengths, this would require further investigation. I would be interested in your experience with this also if you do try it out.