possible device issue with `transformer.compute_features`
duncanmcelfresh opened this issue · 2 comments
Describe the bug
When running transformer.compute_features
with device='cuda'
, an error occurs where it appears we are trying to run numpy/cpu operations on a tensor on gpu. (this error: https://discuss.pytorch.org/t/how-to-fix-cant-convert-cuda-0-device-type-tensor-to-numpy-use-tensor-cpu-to-copy-the-tensor-to-host-memory-first/159656)
Note: This error does not occur with device='cpu'
.
Env: this occurs on databricks 14.3 LTS ML environment, on an Azure Standard_NC24ads_A100_v4 instance.
Steps to reproduce the bug
I do not yet have a reproducible example. Assume you have a meds-format dataset
, a motor model at model_path
and a set of meds labels
.
features = femr.models.transformer.compute_features(
dataset=dataset,
model_path=model_path,
labels=labels,
num_proc=4,
tokens_per_batch=1024,
device='cuda',
ontology=None
)
Expected results
Function returns without error.
Actual results
Traceback:
...
---> 13 features = femr.models.transformer.compute_features(
14 dataset=dataset,
15 # dataset=dataset,
16 model_path=model_path,
17 labels=labels,
18 num_proc=4,
19 tokens_per_batch=1024,
20 device='cuda',
21 ontology=None
22 )
File /.../lib/python3.10/site-packages/femr/models/transformer.py:409, in compute_features(dataset, model_path, labels, num_proc, tokens_per_batch, device, ontology)
406 all_representations = []
408 for batch in batches:
--> 409 batch = processor.collate([batch])["batch"]
410 with torch.no_grad():
411 patient_ids, feature_times, representations = model(batch)
File /.../lib/python3.10/site-packages/femr/models/processor.py:313, in FEMRBatchProcessor.collate(self, batches)
311 def collate(self, batches: List[Mapping[str, Any]]) -> Mapping[str, Any]:
312 assert len(batches) == 1, "Can only have one batch when collating"
--> 313 return {"batch": _add_dimension(self.creator.cleanup_batch(batches[0]))}
File /.../lib/python3.10/site-packages/femr/models/processor.py:258, in BatchCreator.cleanup_batch(self, batch)
253 def cleanup_batch(self, batch: Dict[str, Any]) -> Dict[str, Any]:
254 """Clean a batch, applying final processing.
255
256 This is necessary as some tasks use sparse matrices that need to be postprocessed."""
--> 258 batch["transformer"]["patient_lengths"] = np.array(batch["transformer"]["patient_lengths"])
259 assert isinstance(batch["transformer"]["patient_lengths"], np.ndarray)
261 # BUG: possible issue: is "task" not getting passed here, even if self.task is not None?
File /databricks/python/lib/python3.10/site-packages/torch/_tensor.py:970, in Tensor.__array__(self, dtype)
968 return handle_torch_function(Tensor.__array__, (self,), self, dtype=dtype)
969 if dtype is None:
--> 970 return self.numpy()
971 else:
972 return self.numpy().astype(dtype, copy=False)
Environment info
This occurs in a databricks 14.3 LTS ML environment, on an Azure Standard_NC24ads_A100_v4 instance.
datasets
version: 2.15.0- Platform: Linux-5.15.0-1056-azure-x86_64-with-glibc2.35
- Python version: 3.10.12
huggingface_hub
version: 0.19.4- PyArrow version: 8.0.0
- Pandas version: 1.5.3
fsspec
version: 2023.6.0
note - this occurs on a version of the femrv2_develop
branch.
I believe Michael just fixed this. Closing.