PaccMann/paccmann_gp

protein sequence being passed inside a list to the language model resulting in empty tokenization

YoelShoshan opened this issue · 2 comments

In https://github.com/PaccMann/paccmann_gp/blob/main/paccmann_gp/affinity_minimization.py#L50

[self.protein] is sent to the language model sequence_to_token_indexes(), which expects a string.
This results in the entirety of the protein being treated as a single token to encode, resulting in a basically empty tokenization for all protein cases.

fix suggestions:

  1. self.protein should be used instead of [self.protein]
  2. It's possible that a stricter type check will be useful in pytoda lib

Closed in #5 .