Running the predictor with incomplete data
Closed this issue · 1 comments
As mentioned in your paper, Tulip is supposed to be able to provide predictions when various information is not available, such as the CDR3 alpha chain. I have investigated the code, but cannot see how it is possible to indicate to the model that some of the inputs are unknown. I noticed in tokenizer.json
that there is a special token which seems to be reserved for "unknown" sequences (<UNK>
); I couldn't see however how to include this in the inputs. Is the addition of this token all that's required for the model to process incomplete data?
Other things I have tried include (using the test input data provided VDJ_test_2.csv
):
- Deleting the column in the
.csv
(error because the column is no longer available) - Making each entry in a column blank (provides an error because it expects a non-empty string)
- Making an entry
"<UNK>"
but this didn't seem to do anything significant
Would it be possible for you to direct me, or provide a code sample utilising this functionality of the model?
One final thing I wanted to check is whether the model only takes and recognises a specific set of characters for the various chains? I've tried numbers, writing special tokens, as well as the usual alphabet denoting proteins, and all of these have appeared to work; Should this be the case?
Let me know if you need any further information.
The Missing token is . replace the missing CDR with this token.
EDIT Github do not want to print the missing token in my response... < MIS > without the space...