[ Speedster] With Hugging Face notebook code on nebulydocker/nebullvm container: RuntimeError: Expected all tensors to be on the same device
trent-s opened this issue · 5 comments
Hi! Thank you for your continued work with this project! I would like to report a possible TensorFlow GPU configuration issue with the documented nebulydocker/nebullvm container that appears to prevent notebook code from running.
I am trying to use code in the Hugging Face notebook found at
https://github.com/nebuly-ai/nebuly/blob/main/optimization/speedster/notebooks/huggingface/Accelerate_Hugging_Face_PyTorch_BERT_with_Speedster.ipynb
And am running in the current nebulydocker/nebullvm docker container documented at
https://docs.nebuly.com/Speedster/installation/#optional-download-docker-images-with-frameworks-and-optimizers
Here is exact Python code I am trying to run (essentially code from the notebook with a couple of diagnostic lines added.):
#!/usr/bin/python
import os
import torch
from transformers import BertTokenizer, BertModel
import random
from speedster import optimize_model
tensorrt_path = "/usr/local/lib/python3.8/dist-packages/tensorrt"
if os.path.exists(tensorrt_path):
os.environ['LD_LIBRARY_PATH'] += f":{tensorrt_path}"
else:
print("Unable to find TensorRT path. ONNXRuntime won't use TensorrtExecutionProvider.")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()
sentences = [
"Mars is the fourth planet from the Sun.",
"has a crust primarily composed of elements",
"However, it is unknown",
"can be viewed from Earth",
"It was the Romans",
]
len_dataset = 100
texts = []
for _ in range(len_dataset):
n_times = random.randint(1, 30)
texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))
encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]
dynamic_info = {
"inputs": [
{0: 'batch', 1: 'num_tokens'},
{0: 'batch', 1: 'num_tokens'},
{0: 'batch', 1: 'num_tokens'},
],
"outputs": [
{0: 'batch', 1: 'num_tokens'},
{0: 'batch'},
]
}
optimized_model = optimize_model(
model=model,
input_data=encoded_inputs,
optimization_time="constrained",
ignore_compilers=["onnx_tensor_rt","onnx_tvm","onnxruntime","tensor_rt", "tvm"],
device=str(device),
dynamic_info=dynamic_info,
)
print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optimized_model.device))
encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]
# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
with torch.no_grad():
final_out = model(**encoded_input)
print (final_out)
Just in case it is useful, starting up the container looks like this:
$ docker run -ti --rm -v ~/data:/data -v ~/src:/src --gpus=all nebulydocker/nebullvm:latest
=====================
== NVIDIA TensorRT ==
=====================
NVIDIA Release 23.03 (build 54538654)
NVIDIA TensorRT Version 8.5.3
Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
https://developer.nvidia.com/tensorrt
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh
To install the open-source samples corresponding to this TensorRT release version
run /opt/tensorrt/install_opensource.sh. To build the open source parsers,
plugins, and samples for current top-of-tree on master or a different branch,
run /opt/tensorrt/install_opensource.sh -b <branch>
See https://github.com/NVIDIA/TensorRT for more information.
And, this is the output that I get running the above code:
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
2023-06-21 07:44:32.387780: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point r
ound-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-21 07:44:32.437353: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-cri
tical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-21 07:44:34.329062: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned a
bove are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries
for your platform.
Skipping registering GPU devices...
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.b
ias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', '
cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequ
enceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica
tion model from a BertForSequenceClassification model).
2023-06-21 07:44:42 | INFO | Running Speedster on GPU:0
2023-06-21 07:44:46 | INFO | Benchmark performance of original model
2023-06-21 07:44:47 | INFO | Original model latency: 0.011019186973571777 sec/iter
============= Diagnostic Run torch.onnx.export version 2.0.0+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
2023-06-21 07:44:53 | INFO | [1/2] Running PyTorch Optimization Pipeline
2023-06-21 07:44:53 | INFO | Optimizing with PytorchBackendCompiler and q_type: None.
2023-06-21 07:44:54 | WARNING | Unable to trace model with torch.fx
2023-06-21 07:46:04 | INFO | Optimized model latency: 0.007783412933349609 sec/iter
2023-06-21 07:46:04 | INFO | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-06-21 07:46:04 | WARNING | Unable to trace model with torch.fx
2023-06-21 07:47:44 | INFO | Optimized model latency: 0.007919073104858398 sec/iter
2023-06-21 07:47:44 | INFO | [2/2] Running ONNX Optimization Pipeline
[Speedster results on Tesla V100-PCIE-32GB]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric ┃ Original Model ┃ Optimized Model ┃ Improvement ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend ┃ PYTORCH ┃ TorchScript ┃ ┃
┃ latency ┃ 0.0110 sec/batch ┃ 0.0078 sec/batch ┃ 1.42x ┃
┃ throughput ┃ 90.75 data/sec ┃ 128.48 data/sec ┃ 1.42x ┃
┃ model size ┃ 438.03 MB ┃ 438.35 MB ┃ 0% ┃
┃ metric drop ┃ ┃ 0 ┃ ┃
┃ techniques ┃ ┃ fp32 ┃ ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛
Max speed-up with your input parameters is 1.42x. If you want to get a faster optimized model, see the following link for some suggestions: https://docs.nebuly.com/Speedster/advanced_options/#acceleration-suggestions
Type of optimized model: <class 'nebullvm.operations.inference_learners.huggingface.HuggingFaceInferenceLearner'> on device: None
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /src/./sample.py:68 in <module> │
│ │
│ 65 # Warmup for 30 iterations │
│ 66 for encoded_input in encoded_inputs[:30]: │
│ 67 │ with torch.no_grad(): │
│ ❱ 68 │ │ final_out = model(**encoded_input) │
│ 69 │
│ 70 print (final_out) │
│ 71 │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:1013 in forward │
│ │
│ 1010 │ │ # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x s │
│ 1011 │ │ head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) │
│ 1012 │ │ │
│ ❱ 1013 │ │ embedding_output = self.embeddings( │
│ 1014 │ │ │ input_ids=input_ids, │
│ 1015 │ │ │ position_ids=position_ids, │
│ 1016 │ │ │ token_type_ids=token_type_ids, │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py:230 in forward │
│ │
│ 227 │ │ │ │ token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self. │
│ 228 │ │ │
│ 229 │ │ if inputs_embeds is None: │
│ ❱ 230 │ │ │ inputs_embeds = self.word_embeddings(input_ids) │
│ 231 │ │ token_type_embeddings = self.token_type_embeddings(token_type_ids) │
│ 232 │ │ │
│ 233 │ │ embeddings = inputs_embeds + token_type_embeddings │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py:162 in forward │
│ │
│ 159 │ │ │ │ self.weight[self.padding_idx].fill_(0) │
│ 160 │ │
│ 161 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 162 │ │ return F.embedding( │
│ 163 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │
│ 164 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │
│ 165 │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2210 in embedding │
│ │
│ 2207 │ │ # torch.embedding_renorm_ │
│ 2208 │ │ # remove once script supports set_grad_enabled │
│ 2209 │ │ _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) │
│ ❱ 2210 │ return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) │
│ 2211 │
│ 2212 │
│ 2213 def embedding_bag( │
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Attempting to call the model appears to cause the final RuntimeError
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
This seems like it may be related to optimized_model.device
being none
.
Just FYI, GPU seems to be accessible on this container:
# nvidia-smi
Wed Jun 21 09:05:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-32GB Off| 00000000:AF:00.0 Off | 0 |
| N/A 33C P0 23W / 250W| 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-32GB Off| 00000000:D8:00.0 Off | 0 |
| N/A 32C P0 24W / 250W| 5MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
# python -c "import torch; print(torch.cuda.is_available())"
True
Thank you for looking at this.
Thank you very much for taking a look at this. That is a good point. The "cannot dlopen some GPU libraries" message sounds serious.
I have a question about the workaround you suggested. I tried to perform optimized_model.to(device)
to force the model to the gpu, but as the following output shows, there was no .to()
method.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/./hGPT2gpu.py:67 in <module> │
│ │
│ 64 │
│ 65 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim │
│ 66 print("moving model to gpu") │
│ ❱ 67 optimized_model.to(device) │
│ 68 print ("Type of optimized model: "+str(type(optimized_model)) + " on device: "+str(optim │
│ 69 │
│ 70 # print (dir(optimized_model)) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'HuggingFaceInferenceLearner' object has no attribute 'to'
Is there another way to move the model to cuda? Thanks!
It's inference Learner object
I am not exactly sure how to move it, but a higher level view would be to get the model out of the inference learner, and move it to gpu
Thanks! That sounds like a good suggestion. I will try that!
Seems related to pytorch/pytorch#72175, solution is to first export to onnx on CPU, then optimize it on the GPU.