Everything is a Token

Question

Everything is a Token

Closed this issue 22 days ago · 2 comments

As opposed to triplet embeddings, we should try an everything is a token approach used in past works : CEHR BERT, ETHOS

For example, imagine a patient has a time series of two observations: a potassium lab in quantile 9/10, and one day later a creatinine lab in quantile 2/10.

We could define this as three tokens:

quantile 9/10 potassium lab
1-day time gap
quantile 2/10 creatinine lab

We could also define this as 5 tokens:

potassium lab
quantile 9/10 quantile
1-day time gap
creatinine lab
quantile 2/10

Let's support both!

There's a nice figure in the ethos paper of this:

Answer 1 · 2024-07-09T23:21:06.000Z

^ These are the 13 time tokens

Answer 2 · 2024-08-02T17:24:48.000Z

Most interesting to me:

Replicating the strategies used in the literature as something like this strategy is pretty common.
Are numeric values better used as continuous or as categorical modifiers (a related but independent question from this is are values better embedded in a code-specific (e.g., code is "LAB//HR//Q5") or code-independent manner (e.g., sequence is "LAB//HR", "Q5"))
Is a longer, ~per-measurement sequence better than a shorter, per-event sequence?
Is temporal information useful (and if so how)?
- As a Temporal Position Embedding (TPE) over measurements/event embeddings (this is different from and maybe better or maybe worse than ordinal position embeddings (PEs)).
- As a time-interval token
- Not used