Restructure repo to avoid need for LFS; drop LFS

Question

Restructure repo to avoid need for LFS; drop LFS

Closed this issue 2 years ago · 5 comments

We'd like to drop LFS use if possible to make this repo more friendly to developers. This would likely entail removal of e.g. gromacs-specific files and sticking with PDBs, SDFs, for system representation.

Answer 1 · 2022-05-10T16:28:20.000Z

So in terms of size of things we would want to keep:
Size of data//01_protein/crd - (non-solvated protein PDBs + cofactor PDBs) - 8.5 M
Size of data//02_ligands/*/crd - (SDF ligand files) - 5.2 M

Answer 2 · 2022-05-24T16:22:47.000Z

A quick question here - how would we feel about concatenating the ligand SDFs to a single file per system? It would make life a lot easier rather than reading a bunch individually, but maybe there's an alternative use case that would prefer not having everything in the one file?

Answer 3 · 2022-05-24T22:03:40.000Z

I'd recommend keeping them as totally separate files. This will improve interoperability and reliability.

Multi-structure SDFs are a little risky, since there's ambiguity about whether the structures are conformers of the same molecule or totally separate molecules. Different tools make different assumptions here - OpenEye has a whole thing about which behaviors are triggered in which cases and OpenFF just says "multi-conformer SDFs don't exist, they're always separate molecules". I forgot what RDKit does, but my point is that different tools will have different behaviors when confronted with a multi-molecule SDF and it will eventually affect software reliability.

Answer 4 · 2022-05-31T01:02:46.000Z

We'll handle this in #52. I've already dropped the LFS filters there in 7718225. That will keep files from being handled by LFS once we merge that PR. We'll remove the Gromacs-specific files and simplify the file structure for targets+ligands in that PR as well.

Answer 5 · 2022-09-01T19:15:45.000Z

All: Apologies for missing this conversation.

Are there any concrete examples where concatenated SDF files present significant, existential risk to real free energy pipelines? It seems like we could easily address this if needed by delivering multimolecule SDF files, but providing a very simple script to break up ligands into different files.

This would have a huge advantage over maintaining literally thousands of individual SDF files in github, since nearly all tools are happy to process these files together. Processing a bunch of separate SDF files for anything, even visualization, is just a huge pain.