protein sequence being passed inside a list to the language model resulting in empty tokenization
YoelShoshan opened this issue · 2 comments
YoelShoshan commented
In https://github.com/PaccMann/paccmann_gp/blob/main/paccmann_gp/affinity_minimization.py#L50
[self.protein] is sent to the language model sequence_to_token_indexes(), which expects a string.
This results in the entirety of the protein being treated as a single token to encode, resulting in a basically empty tokenization for all protein cases.
fix suggestions:
- self.protein should be used instead of [self.protein]
- It's possible that a stricter type check will be useful in pytoda lib
drugilsberg commented
Thanks @YoelShoshan opening soon a PR with the fix as reported here: https://github.com/PaccMann/paccmann_kinase_binding_residues/blob/5c11b34934a9160201de5beb19e8a6117c0ffbb7/pkbr/generative.py#L109
drugilsberg commented
Closed in #5 .