mol2vec
h4RkhW8t53e opened this issue · 6 comments
Hi,
Thank you for your effort and making your code available!
I was wondering if it would be possible to add mol2vec embeddings (https://github.com/samoturk/mol2vec)?
Best regards
Thank you, @h4RkhW8t53e! That's a great suggestion! Any other open models you'd recommend looking into?
Thank you for your quick reply! Listed below are models that also look interesting, but I have not had a chance to test:
https://github.com/HIPS/neural-fingerprint
https://github.com/Laboratoire-de-Chemoinformatique/VQGAE
I tried mol2vec with faiss, but would like to retrain with lower latent dimension (300 in the provided model,
https://github.com/samoturk/mol2vec/tree/master/examples/models).
It will be very useful to have an option for fingerprints with float/int components in usearch_molecules for NN derived fps.
Many thanks.
Thanks for the references @h4RkhW8t53e! I have a couple of big ongoing commitments right now, but would love to get back to this a bit later. In the meantime, please share any new findings and models!
On a related note, are you familiar with any high-quality human-curated datasets of "similar molecules"? Even small ones. I am looking for some "ground truth" to evaluate search quality using different fingerprints/embeddings?
Thank you, please let me know of any updates.
Regarding datasets of similar molecules, ChEMBL ( https://www.ebi.ac.uk/chembl/ ) is a manually curated database of bioactive molecules with Structure Activity Relationships to different targets. Within target-related sets, molecules are similar. Hope this is useful to you.
A small update, fingerprints as float/int vectors in usearch_molecules seem to work okay with trivial modifications.
So far, only tested on 10k PubChem compounds ...
Also, found useful to have Compound IDs (cid), in addition to smiles, for keeping track of compounds.
Best
Hi, some experiments with mol2vec code:
https://github.com/h4RkhW8t53e/usearch-molecules/tree/mol2vec