possible device issue with `transformer.compute_features`

Question

possible device issue with `transformer.compute_features`

duncanmcelfresh opened this issue 10 months ago · 2 comments

Describe the bug

When running transformer.compute_features with device='cuda', an error occurs where it appears we are trying to run numpy/cpu operations on a tensor on gpu. (this error: https://discuss.pytorch.org/t/how-to-fix-cant-convert-cuda-0-device-type-tensor-to-numpy-use-tensor-cpu-to-copy-the-tensor-to-host-memory-first/159656)

Note: This error does not occur with device='cpu'.

Env: this occurs on databricks 14.3 LTS ML environment, on an Azure Standard_NC24ads_A100_v4 instance.

Steps to reproduce the bug

I do not yet have a reproducible example. Assume you have a meds-format dataset, a motor model at model_path and a set of meds labels.

features = femr.models.transformer.compute_features(
    dataset=dataset,
    model_path=model_path, 
    labels=labels, 
    num_proc=4, 
    tokens_per_batch=1024,
    device='cuda',
    ontology=None
    )

Expected results

Function returns without error.

Actual results

Traceback:

...
---> 13 features = femr.models.transformer.compute_features(
     14     dataset=dataset,
     15     # dataset=dataset,
     16     model_path=model_path, 
     17     labels=labels, 
     18     num_proc=4, 
     19     tokens_per_batch=1024, 
     20     device='cuda',
     21     ontology=None
     22     )

File /.../lib/python3.10/site-packages/femr/models/transformer.py:409, in compute_features(dataset, model_path, labels, num_proc, tokens_per_batch, device, ontology)
    406 all_representations = []
    408 for batch in batches:
--> 409     batch = processor.collate([batch])["batch"]
    410     with torch.no_grad():
    411         patient_ids, feature_times, representations = model(batch)

File /.../lib/python3.10/site-packages/femr/models/processor.py:313, in FEMRBatchProcessor.collate(self, batches)
    311 def collate(self, batches: List[Mapping[str, Any]]) -> Mapping[str, Any]:
    312     assert len(batches) == 1, "Can only have one batch when collating"
--> 313     return {"batch": _add_dimension(self.creator.cleanup_batch(batches[0]))}

File /.../lib/python3.10/site-packages/femr/models/processor.py:258, in BatchCreator.cleanup_batch(self, batch)
    253 def cleanup_batch(self, batch: Dict[str, Any]) -> Dict[str, Any]:
    254     """Clean a batch, applying final processing.
    255 
    256     This is necessary as some tasks use sparse matrices that need to be postprocessed."""
--> 258     batch["transformer"]["patient_lengths"] = np.array(batch["transformer"]["patient_lengths"])
    259     assert isinstance(batch["transformer"]["patient_lengths"], np.ndarray)
    261     # BUG: possible issue: is "task" not getting passed here, even if self.task is not None?

File /databricks/python/lib/python3.10/site-packages/torch/_tensor.py:970, in Tensor.__array__(self, dtype)
    968     return handle_torch_function(Tensor.__array__, (self,), self, dtype=dtype)
    969 if dtype is None:
--> 970     return self.numpy()
    971 else:
    972     return self.numpy().astype(dtype, copy=False)

Environment info

This occurs in a databricks 14.3 LTS ML environment, on an Azure Standard_NC24ads_A100_v4 instance.

datasets version: 2.15.0
Platform: Linux-5.15.0-1056-azure-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.19.4
PyArrow version: 8.0.0
Pandas version: 1.5.3
fsspec version: 2023.6.0

Answer 1 · 2024-02-29T22:24:03.000Z

note - this occurs on a version of the femrv2_develop branch.

Answer 2 · 2024-04-28T20:08:07.000Z

I believe Michael just fixed this. Closing.