Morgan fingerprint Collision ATtAcker (MorCAta)
This script generates fingerprint-level collisions for morgan fingerprints. The input to the script is an SDF with a certain amount of molecules in there, the output is an SDF with RDKit parsable molecules in there that have an identical fingerprint despite being chemically unrelated and nonsensical. By making use of nonsensical atoms that have a single bit fingerprint, a matching fingerprint can be built up unconnected atom per unconnected atom.
python morcata.py -i molecules.sdf -o output.sdf -l 2048 -r 2
The command used to generate a fingerprint matched molecule for every molecule in molecules.sdf
, using a morgan fingerprint of radius 2
(ECFP4-like) and length of 2048
, and save the results in the sdf file output.sdf
. You'll need to have a somewhat recent RDKit installed for this to work.
No, this method is NOT the same as inverting fingerprints to their parent structure, although if you want to do this check out my repo for "ECFPInvert" here: https://github.com/dehaenw/ECFPinvert, alternatively checkout NeuralDecipher and MolForge for deep learning based approaches to the same problem.
There will be a faster and refactored version of ECFPInvert available fairly soon.
This is a proof of concept to show a hash collision attack (treating the entire fingerprint as the hash for a molecule) is viable and fast for the frequently used morgan fingerprint. For any practical purpose you will probably not want to do it like this.
Tested on drug bank: 0.486 ms/molecule/cpu core. The code is not optimized for speed an could be significantly sped up, as it is just based on N dictionary lookups per molecule, with N the total amount of on-bits in the fingerprint.
No, although collisions are known for InChI keys, collisions are computationally more demanding to generate and rely on more brute force methods like enumerating diastereomers.