Everything is a Token
Closed this issue · 2 comments
Oufattole commented
As opposed to triplet embeddings, we should try an everything is a token approach used in past works : CEHR BERT, ETHOS
For example, imagine a patient has a time series of two observations: a potassium lab in quantile 9/10, and one day later a creatinine lab in quantile 2/10.
- We could define this as three tokens:
- quantile 9/10 potassium lab
- 1-day time gap
- quantile 2/10 creatinine lab
- We could also define this as 5 tokens:
- potassium lab
- quantile 9/10 quantile
- 1-day time gap
- creatinine lab
- quantile 2/10
Let's support both!
Oufattole commented
mmcdermott commented
Most interesting to me:
- Replicating the strategies used in the literature as something like this strategy is pretty common.
- Are numeric values better used as continuous or as categorical modifiers (a related but independent question from this is are values better embedded in a code-specific (e.g., code is "LAB//HR//Q5") or code-independent manner (e.g., sequence is "LAB//HR", "Q5"))
- Is a longer, ~per-measurement sequence better than a shorter, per-event sequence?
- Is temporal information useful (and if so how)?
- As a Temporal Position Embedding (TPE) over measurements/event embeddings (this is different from and maybe better or maybe worse than ordinal position embeddings (PEs)).
- As a time-interval token
- Not used