barthelemymp/TULIP-TCR

Running the predictor with incomplete data

Closed this issue · 1 comments

As mentioned in your paper, Tulip is supposed to be able to provide predictions when various information is not available, such as the CDR3 alpha chain. I have investigated the code, but cannot see how it is possible to indicate to the model that some of the inputs are unknown. I noticed in tokenizer.json that there is a special token which seems to be reserved for "unknown" sequences (<UNK>); I couldn't see however how to include this in the inputs. Is the addition of this token all that's required for the model to process incomplete data?

Other things I have tried include (using the test input data provided VDJ_test_2.csv):

  • Deleting the column in the .csv (error because the column is no longer available)
  • Making each entry in a column blank (provides an error because it expects a non-empty string)
  • Making an entry "<UNK>" but this didn't seem to do anything significant

Would it be possible for you to direct me, or provide a code sample utilising this functionality of the model?

One final thing I wanted to check is whether the model only takes and recognises a specific set of characters for the various chains? I've tried numbers, writing special tokens, as well as the usual alphabet denoting proteins, and all of these have appeared to work; Should this be the case?

Let me know if you need any further information.

The Missing token is . replace the missing CDR with this token.
EDIT Github do not want to print the missing token in my response... < MIS > without the space...