intermittent issue with `LabeledPatientTask.add_event` when calling `femr.models.transformer.compute_features`
duncanmcelfresh opened this issue · 1 comments
duncanmcelfresh commented
Describe the bug
add_event
fails with an assertion error ("We have labels that appear to be before birth?") when calling femr.models.transformer.compute_features
using labels created for all patients + events in a dataset.
Steps to reproduce the bug
# Sample code to reproduce the bug
I cannot share the dataset. The events in the dataset include all of the dates in the test labels.
The following code snippet raises the assertion error:
import datetime
test_labels = [
{'patient_id': 1, 'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False},
{'patient_id': 1, 'prediction_time': datetime.datetime(1994, 3, 23, 0, 0), 'boolean_value': False},
{'patient_id': 1, 'prediction_time': datetime.datetime(2019, 1, 11, 0, 0), 'boolean_value': False},
]
features = femr.models.transformer.compute_features(test_dataset,os.path.join(TARGET_DIR, "motor_model"), test_labels, num_proc=4, tokens_per_batch=128, ontology=None)
However, if we change the third label prediction time to an earlier date, then this assertion error is not raised. The following code snippet runs without error:
import datetime
test_labels = [
{'patient_id': 1, 'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False},
{'patient_id': 1, 'prediction_time': datetime.datetime(1994, 3, 23, 0, 0), 'boolean_value': False},
{'patient_id': 1, 'prediction_time': datetime.datetime(1997, 1, 11, 0, 0), 'boolean_value': False},
]
features = femr.models.transformer.compute_features(test_dataset,os.path.join(TARGET_DIR, "motor_model"), test_labels, num_proc=4, tokens_per_batch=128, ontology=None)
Expected results
the code above should create the object features
without raising an assertion error
Actual results
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/datasets/builder.py", line 1726, in _prepare_split_single
for key, record in generator:
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/datasets/packaged_modules/generator/generator.py", line 30, in _generate_examples
for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/processor.py", line 204, in _batch_generator
creator.add_patient(dataset[patient_index.item()], offset, length)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/processor.py", line 127, in add_patient
num_added = self.task.add_event(last_time, event["time"], features)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/tasks.py", line 101, in add_event
assert is_valid, (
AssertionError: We have labels that appear to be before birth? 1 {'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False} 2017-05-15 00:00:00 2017-05-24 00:00:00
Environment info
datasets
version: 2.16.1- Platform: Linux-5.15.0-1053-azure-x86_64-with-glibc2.35
- Python version: 3.10.12
huggingface_hub
version: 0.20.2- PyArrow version: 14.0.2
- Pandas version: 1.5.3
fsspec
version: 2023.6.0
EthanSteinberg commented
This was discussed offline and appears to be fixed now.