Misalignment Between Static and Event Sequence DataFrames
Closed this issue · 2 comments
I've noticed an issue during the tokenization stage on my dataset. Specifically, the static DataFrame for shard 0 (${cohort_dir}/tokenization/schemas/train/0.parquet
) has a shape of ${cohort_dir}/tokenization/event_seqs/train/0.parquet
) has a shape of
These DataFrames are supposed to be aligned, meaning each index in the static DataFrame should correspond directly to an index in the event sequence DataFrame. However, the shape mismatch suggests that this may not be the case.
I'm planning to reproduce this issue on a dummy dataset to see if it persists. However, I'm curious whether this behavior is expected, as it impacts the assumptions made in the PyTorch dataset class in meds-torch. This class assumes alignment between the static DataFrames and the joint nested ragged tensors, which, as far as I understand, are derived from the event sequence DataFrames.
Could this be a bug, or is there an intended reason for the discrepancy in shapes?
I suspect this may be for patients who don't have any static data, maybe? that's just a guess though. I'm looking now.
Yep -- it is patients who don't have static data, I'm almost certain. This join here uses an "inner" join, when it should probably use a full outer join instead: https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/transforms/tokenization.py#L161
You should be able to validate that this causes an issue by adding some patients with no static data to (1) the doctest for the function linked above and (2) the single-stage integration test here: https://github.com/mmcdermott/MEDS_transforms/blob/main/tests/MEDS_Transforms/test_tokenization.py