protein sequence being passed inside a list to the language model resulting in empty tokenization

Question

protein sequence being passed inside a list to the language model resulting in empty tokenization

YoelShoshan opened this issue 3 years ago · 2 comments

In https://github.com/PaccMann/paccmann_gp/blob/main/paccmann_gp/affinity_minimization.py#L50

[self.protein] is sent to the language model sequence_to_token_indexes(), which expects a string.
This results in the entirety of the protein being treated as a single token to encode, resulting in a basically empty tokenization for all protein cases.

fix suggestions:

self.protein should be used instead of [self.protein]
It's possible that a stricter type check will be useful in pytoda lib

Answer 1 · 2022-02-21T09:28:23.000Z

Thanks @YoelShoshan opening soon a PR with the fix as reported here: https://github.com/PaccMann/paccmann_kinase_binding_residues/blob/5c11b34934a9160201de5beb19e8a6117c0ffbb7/pkbr/generative.py#L109

Answer 2 · 2022-02-21T09:39:27.000Z

Closed in #5 .