Merck/deepbgc

Retrain on MIBiG 2.0

Opened this issue · 11 comments

Our current model is trained on MIBiG 1.4, we can retrain (and validate) the model.

We can also demonstrate how our current model performs on the newly added BGCs.

When you do this, would you mind providing, perhaps, a walkthrough? I'm finding the current documentation on training a bit confusing, in regards to preparing TSVs, JSONs, and choice of models.

(I ask because I'd like to experiment with modifying the MIBIG or other BGC data to eliminate regulatory/transport/integration Pfams to make a biosynthesis-focused training set.)

That's a good point, I will keep that in mind, perhaps I could publish a short Medium article with a walkthrough. I think the biggest challenge here is finding a way to validate the results, since when I train the model with full MIBiG 2.0, there's nothing left to validate it on :)

I also heard that Pfam 33.0 should be coming any day now, so it would be great to retrain a new pfam2vec version as well.

It might also be interesting to create a reduced Pfam database that's relevant for our model, that might also significantly reduce the hmmscan run time. It's important to keep in mind though that the "non-BGC" pfams are also important for the model to be able to distinguish positives from negatives. But if you just plan on using a more strict definition of what a BGC region is, you can definitely do that just by retraining on whatever regions you consider "positive" (they don't even need to be BGCs but any kind of regions of interest really).

@danudwary Also please let me know if there's anything specific that you want me to elaborate on in the training documentation.

A Medium post would be fantastic. If you need a tester or proof-reader, I'm happy to volunteer.

If you need something to validate with, I have well over 100,000 unpublished (manuscript in review) BGCs derived from bacterial and archaeal MAGs, as predicted by AntiSMASH. Well over 95% are not directly homologous to anything in MIBIG. They aren't experimentally validated, granted, so it depends on how strict you'd want that validation criteria to be, I suppose.

Hi, can u plz tell me how to download MiBG database where is the exact location.

Hi, has the current version (0.1.29) been updated to MiBiG 2.0? I cannot find the MiBiG information for my deepbgc?

Hi @ZhangDengwei, DeepBGC has not been retrained to include MiBiG 2.0 yet. The tricky part is not training a new model, but validating that it works, using an independent set of sequences. We might be able to find time for it in the future, but there's no timeline yet. In case you want to get involved, please let me know.

One positive thing is that based on some preliminary tests we made, the current DeepBGC version seems to be able to detect the new MiBiG 2.0 BGCs very well, even though they were not part of the training set.

Just so you know: The MIBiG devs recently held an "annotatathon" for MIBiG 3.0. Several hundred new BGCs are being added. The data isn't released yet, but the sequence has been gathered. You might reach out to Marnix Medema and at least ask for a timeline, if not early access to the new BGCs for training validation. I think the community is very interested in seeing DeepBGC improve, so I'm sure they'd be willing to help!