kmansouri/OPERA

Storage of PadDEL and CDK descriptors and fingerprints in internal sqldb for re-use

tsufz opened this issue · 4 comments

tsufz commented

Hi Kamel,
I have a suggestion for improvement of OPERA. The main bottleneck of OPERA is the tedious estimation of PadDEL descriptors and fingerprints. As the descriptors and fingerprints will not change at all, I suggest adding an internal database, which stores the descriptors, the smiles, inchikey, or other identifiers characterizing the structure. In cases of matched compounds, the DB value is taken, and only new compounds are modelled. This is also good for the climate, as unnecessary computing is avoided.

Best
Tobias

tsufz commented

I know, there is the PaDEL.csv which stores the descriptors, but the key is the molecule ID, which may change. I suggest using here really the inchikey or other structure-based IDs, which can be re-used by OPERA without big attention / curation efforts of the users. This would really help!

tsufz commented

In addition, where are the CDK descriptors stored and how can I reuse the CDK desc and the PaDEL FP? Also adding by the -d switch?

Hi, Thanks for the suggestion. However, the problem with saving the descriptors within OPERA is space. Github has a limit on the package size and we're already saving the structures and their IDs. This way you can generate predictions by only having an .txt input file with IDs such as CAS, DTXSID or InChikey. Adding the descriptors won't be possible with the current Github limit on size.
However, we already thought about saving computing time as much as possible. That's why we're providing the option to keep the full descriptor files and reuse them to calculate other models.
This is possible for PaDel, CDK and the fingerprints. See the help file or --h for more info.

tsufz commented

@kmansouri,
thanks for your reply. I think this was a misunderstanding. I suggested the storage in a local database on the computer of the customers, not the general provision of CompTox / PubChem space in descriptors...

But without big user interaction, as enhancement of the current solution. So to say, an automated DB storing everything and re-using without user interaction, based on

Btw., a full DB of Comptox / PubChem descriptors would be great... Hmm, we could start an action in the NORMAN Association to compute a DB of descriptors (PaDEL, CDK 2/3, and others) and provide it as download on Zenodo or else in SQL / txt formats. We could start for example with suspect list exchange compounds. What do you think?

Best
Tobias