Dataset API and Configuration
Opened this issue · 7 comments
For simplicity, we will always assume that data is 3 dimensional with the dimensions being:
(subject_id, event, measurement)
Optionally, there can be a fourth dimension with tokenized text, a sequence of ecg data, or some other modality. For multimodal data, we assume the max_seq_length for the modality is enforced by a previous stage in meds_transform. Later stages can always randomly subselect extra modalities so it fits on the gpu (so the model learns that more recent are useful and further back are less useful) -- assuming that in inference we can use a super low batch_size and use all observations of extra modalities.
This function is not generally needed and can be deleted I expect: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L699
This can be removed: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L660
Instead, it can be done just within pytorch lightning or other things where you just stop the dataloader after a given number of batches (or actually even just setting the length manually). With the set stats gone it won't add any bias.
Can delete this as well; should happen in a pre-step for MEDS-transforms: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L625
Delete this: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L588
Label schema should cover this and if not we should make it so
Simplify this, but it needs to keep: https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L469
this can go https://github.com/Oufattole/meds-torch/blob/main/src/meds_torch/data/components/pytorch_dataset.py#L392 because it is all binary classification and in the label schema