songlab-cal/tape

Creating new embedding with a portion of data

SukruHan opened this issue · 2 comments

Hi,
First of all thanks for this great effort!
I need to create BERT models with new configurations and re-train the models from scratch using reduced number of data and reduced number of types of proteins if possible.
What kind of path should I follow?

My initial problem is that, I couldn't find a way to re-train BERT model from scratch to create new embeddings. (I am not talking about the fine-tuning of parameters.)
I have checked this issue; #89 and learned about the model modification.
But I think, I need more guidance on this issue.

Thanks, Best Regards

rmrao commented

Since this is a question that gets asked a lot and since TAPE's training machinery is a bit old and not quite maintained, I went ahead and wrote a tutorial of how to train a language model using fairseq, which is Facebook AI's sequence modeling framework. It's very simple, and all you need is a fasta file.

Here's the colab with the tutorial. This is meant to get you started, although you actually can train this in the colab just fine.
https://colab.research.google.com/drive/1JrKtL7bHTSyYYRvqfQhezy0qqiZkuNEb?usp=sharing

Hope this helps!

Hi,
Thanks for the quick answer and guidance.
Best Regards,