som-shahlab/femr

intermittent issue with `LabeledPatientTask.add_event` when calling `femr.models.transformer.compute_features`

duncanmcelfresh opened this issue · 1 comments

Describe the bug

add_event fails with an assertion error ("We have labels that appear to be before birth?") when calling femr.models.transformer.compute_features using labels created for all patients + events in a dataset.

Steps to reproduce the bug

# Sample code to reproduce the bug

I cannot share the dataset. The events in the dataset include all of the dates in the test labels.

The following code snippet raises the assertion error:

import datetime 

test_labels = [
    {'patient_id': 1, 'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False},
    {'patient_id': 1, 'prediction_time': datetime.datetime(1994, 3, 23, 0, 0), 'boolean_value': False},
    {'patient_id': 1, 'prediction_time': datetime.datetime(2019, 1, 11, 0, 0), 'boolean_value': False},
]

features = femr.models.transformer.compute_features(test_dataset,os.path.join(TARGET_DIR, "motor_model"), test_labels, num_proc=4, tokens_per_batch=128, ontology=None)

However, if we change the third label prediction time to an earlier date, then this assertion error is not raised. The following code snippet runs without error:

import datetime 

test_labels = [
    {'patient_id': 1, 'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False},
    {'patient_id': 1, 'prediction_time': datetime.datetime(1994, 3, 23, 0, 0), 'boolean_value': False},
    {'patient_id': 1, 'prediction_time': datetime.datetime(1997, 1, 11, 0, 0), 'boolean_value': False},
]

features = femr.models.transformer.compute_features(test_dataset,os.path.join(TARGET_DIR, "motor_model"), test_labels, num_proc=4, tokens_per_batch=128, ontology=None)

Expected results

the code above should create the object features without raising an assertion error

Actual results

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/datasets/builder.py", line 1726, in _prepare_split_single
    for key, record in generator:
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/datasets/packaged_modules/generator/generator.py", line 30, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/processor.py", line 204, in _batch_generator
    creator.add_patient(dataset[patient_index.item()], offset, length)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/processor.py", line 127, in add_patient
    num_added = self.task.add_event(last_time, event["time"], features)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-dc9f0838-a900-4b8d-b5e3-8aacc4a713f0/lib/python3.10/site-packages/femr/models/tasks.py", line 101, in add_event
    assert is_valid, (
AssertionError: We have labels that appear to be before birth? 1 {'prediction_time': datetime.datetime(1956, 8, 20, 0, 0), 'boolean_value': False} 2017-05-15 00:00:00 2017-05-24 00:00:00

Environment info

  • datasets version: 2.16.1
  • Platform: Linux-5.15.0-1053-azure-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.20.2
  • PyArrow version: 14.0.2
  • Pandas version: 1.5.3
  • fsspec version: 2023.6.0

This was discussed offline and appears to be fixed now.