
Dataset API and Configuration

mmcdermott opened this issue · 7 comments

For simplicity, we will always assume that data is 3 dimensional with the dimensions being:
(subject_id, event, measurement)
Optionally, there can be a fourth dimension with tokenized text, a sequence of ecg data, or some other modality. For multimodal data, we assume the max_seq_length for the modality is enforced by a previous stage in meds_transform. Later stages can always randomly subselect extra modalities so it fits on the gpu (so the model learns that more recent are useful and further back are less useful) -- assuming that in inference we can use a super low batch_size and use all observations of extra modalities.

  • #56
    the input_encoder should handle converting from codes to the code text. It should specifically embed a lookup table (from code to token sequence).
  • #61
  • #62
  • #63
  • #68
  • #74
  • #89
  • #81
  • #92
  • remove the duplicate return out
  • seedable_mixin should only decorate once
  • #70
  • #86
  • #75
  • #87
  • #94

This can be removed: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L660

Instead, it can be done just within pytorch lightning or other things where you just stop the dataloader after a given number of batches (or actually even just setting the length manually). With the set stats gone it won't add any bias.

Can delete this as well; should happen in a pre-step for MEDS-transforms: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L625

Delete this: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L588
Label schema should cover this and if not we should make it so

this can go https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L392 because it is all binary classification and in the label schema