Oufattole/meds-torch

Everything is a Token

Closed this issue · 2 comments

As opposed to triplet embeddings, we should try an everything is a token approach used in past works : CEHR BERT, ETHOS

For example, imagine a patient has a time series of two observations: a potassium lab in quantile 9/10, and one day later a creatinine lab in quantile 2/10.

  • We could define this as three tokens:
  1. quantile 9/10 potassium lab
  2. 1-day time gap
  3. quantile 2/10 creatinine lab
  • We could also define this as 5 tokens:
  1. potassium lab
  2. quantile 9/10 quantile
  3. 1-day time gap
  4. creatinine lab
  5. quantile 2/10

Let's support both!

There's a nice figure in the ethos paper of this:
image

image ^ These are the 13 time tokens

Most interesting to me:

  1. Replicating the strategies used in the literature as something like this strategy is pretty common.
  2. Are numeric values better used as continuous or as categorical modifiers (a related but independent question from this is are values better embedded in a code-specific (e.g., code is "LAB//HR//Q5") or code-independent manner (e.g., sequence is "LAB//HR", "Q5"))
  3. Is a longer, ~per-measurement sequence better than a shorter, per-event sequence?
  4. Is temporal information useful (and if so how)?
    - As a Temporal Position Embedding (TPE) over measurements/event embeddings (this is different from and maybe better or maybe worse than ordinal position embeddings (PEs)).
    - As a time-interval token
    - Not used