Discussion: JSON format with MSA and template features troublesome in practice

Question

Discussion: JSON format with MSA and template features troublesome in practice

Closed this issue 20 days ago · 10 comments

At first, we really liked the idea of having MSA and template features together in the same JSON—it’s super convenient and allows a lot of flexibility. But in practice, it’s been a bit tricky. The format mixes input data with configuration, which can make things harder than expected. We often end up editing config files a lot—adding sequences, tweaking parameters like copies or ligands, and making all sorts of changes.

This means we often end up with multiple JSON files in the working folder, each potentially dozens of MB, which can quickly eat up space. On top of that, editing these files remotely over a network can be frustrating since loading and saving takes ages because of the file size.

Just wanted to share this in case it’s helpful. If it’s not feasible to change, that’s totally fine—we’re thinking about trying an alternative with AlphaPulldown, maybe splitting JSONs to keep per-sequence features separate from the AF3-style config. But please let me know if you’d consider such a change so we know whether to wait before implementing our own solution! 😊

Answer 1 · 2024-11-29T15:03:44.000Z

Yes, good point. While most of the JSON is human readable, human editable and -- most importantly -- small, the MSA and template fields are not.

Would it help if unpairedMsa, pairedMsa, and templates/mmcif fields allowed file paths in them and AlphaFold would read those paths and fill in the data from them when creating the folding_input.Input object?

I'd prefer there isn't an alternative format (https://xkcd.com/927/ 🙂) and work together to enable something that works for you in the standard AlphaFold 3 input format.

Answer 2 · 2024-11-29T15:16:10.000Z

File path for the a3m and cif files would truly awesome.

That would be so much better that having a line with 1e8 characters (my longest MSA line so far)...

Answer 3 · 2024-11-29T15:29:12.000Z

Would it be possible to keep both formats? And automatically identify whether it's a path to file or a file content?

Answer 4 · 2024-11-29T15:46:17.000Z

Would it be possible to keep both formats? And automatically identify whether it's a path to file or a file content?

Of course -- we won't regress on the existing format!

I was thinking of something like introducing optional new fields: unpairedMsaPath, pairedMsaPath and templates/mmcifPath.

Answer 5 · 2024-11-29T16:58:41.000Z

I was thinking of something like introducing optional new fields: unpairedMsaPath, pairedMsaPath and templates/mmcifPath.

Thanks @Augustin-Zidek! Could these optionally support compressed MSA and template files, gz and xz? :-)

Answer 6 · 2024-11-29T17:08:03.000Z

Thanks @Augustin-Zidek! Could these optionally support compressed MSA and template files, gz and xz? :-)

gzip sure, it is supported by Python standard library. Although I wish people stopped using gz and migrated to zstd, it is strictly Pareto better! (See e.g. https://github.com/facebook/zstd/blob/dev/doc/images/CSpeed2.png and https://github.com/facebook/zstd/blob/dev/doc/images/DSpeed3.png)
xz would mean introducing a new third party dependency (which is somewhat scary given the recent xz backdoor), so I am not very keen.
I would be happy to support zstd, since we already have it is a third party dependency.

Answer 7 · 2024-11-29T18:15:55.000Z

Thanks a lot @Augustin-Zidek , these features would be extremely useful indeed! I think by .xz @jkosinski means lzma, which I think is a standard python library. Should be safe to use?..
We chose this scheme for compressing AF2 features (we generated them for all model proteomes) because it showed the best compression rate (even better than zstd), but it is also quite slow. So, it could be that zstd is the better option.

Answer 8 · 2024-11-29T18:21:06.000Z

I found that it is possible to directly read the strings contained in a .a3m and .cif file and append them to the .json in the AF3 input after some string manipulation. At least I found that to be working without any issue when I was folding the spike of HA with known MSA.

Google Colab

Would it be possible to keep both formats? And automatically identify whether it's a path to file or a file content?

Of course -- we won't regress on the existing format!

I was thinking of something like introducing optional new fields: unpairedMsaPath, pairedMsaPath and templates/mmcifPath.

Answer 9 · 2024-12-03T13:18:08.000Z

This feature has been implemented in 6bba345.

Notes:

New fields unpairedMsaPath (for proteins and RNA), pairedMsaPath (for proteins), mmcifPath (for protein templates) introduced.
The AlphaFold input format version has been bumped from 1 to 2 since the input JSON can now have new keys.
Plain text or gzip/xz/zstd compression supported. The compression format is auto-detected based on the magic number in the file header (i.e. the extension doesn't matter).
Paths are either absolute, or relative to the input JSON path.

Feel free to re-open if you hit any issues with this.

Thanks again for suggesting this feature. Happy folding!

Answer 10 · 2024-12-06T14:53:25.000Z

Further improvement landed in b1e3a8a -- the sequences are now correctly deduplicated when the output JSON is produced. This saves a lot of space for complexes with large number of homomeric chains.