mheinzinger/SeqVec

Question regarding training the ELMo model

Closed this issue ยท 9 comments

What is the format of the data while training the protein embedding model (ELMo)? It would be helpful if you can share a snapshot of that.

The format for training ELMo on protein sequences follows the format used in NLP. In NLP, the training corpus usually holds one sentence per line with words being separated by white-spaces. In our case, we considered every protein sequences as a single sentence and each amino acid as a word. An example would look like this:
P R O T E I N
S E Q W E N C E

@mheinzinger I would love to reproduce the trained model. Can you please share the code for training SeqVec? Thanks!

We used the official Tensorflow implementation of ELMo and changed the input to protein sequences:
https://github.com/allenai/bilm-tf
All you have to do is to create an input corpus with protein sequences (one sequence per line, amino acids separated by white-space)

@mheinzinger Thanks for your clarification about the input corpus (one sequence per line)! I noticed that pre-training ELMo using protein sequence requires me to prepare multiple .txt input files into a training folder. Just want to confirm that, how many sequences did you include within one single txt input file (or how many lines do you have within one single input file?

Yes, depending on your hardware setup, it might be beneficial to have multiple splits (esp. in a multi-GPU setting this was important when we trained ProtTrans). However, I do not want to tell you sth wrong and unfortunately I do not perfectly remember how many files we created for SeqVec/ELMo. But I also have to say that most other parameters (learning-rate, batch-size, num_layers, n_hidden, corpus etc.) will have much more impact on the performance. In fact, I think splitting the training data into multiple chunks is mostly related to efficiency and the choice should not affect your final performance. So depending on your setup you might squeeze out a few percent more throughput by tuning this parameter but I doubt that it will change your performance.

Thanks for your quick response! It makes sense to me now

@mheinzinger Since you mentioned your 'ProtTrans', very impressive work btw, love the idea!!

So I retrained a small protein seqs bert model using google's official implementation code(https://github.com/google-research/bert), and I'd love to fine-tune the model based on your pre-trained weights (checkpoints) to see what happens. However, I can only find your pytorch model (.bin) on huggingface. I wonder if you could kindly share the checkpoints for your pre-trained models (checkpoints)?

Thanks in advance for your help!

I recovered the tensorflow checkpoints for ProtBERT-UniRef100 (not ProtBERT-BFD, sorry; could not find the corresponding files for BFD). You can download the ProtBERT-UniRef100 TF-checkpoints here: https://rostlab.org/~deepppi/protbert_u100.tar.gz

Not able to download the models from here. Please let me know from where I will get this model

The ELMo model trained on UniRef50 (=SeqVec) is available at: SeqVec-model

The checkpoint for the pre-trained model is available at: SeqVec-checkpoint