[Bug]: Error: Start the triton server
Mrzhiyao opened this issue · 11 comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
root@85d70c862b32:/opt/tritonserver# tritonserver --model-repository pwd
/models
W1109 05:31:06.568839 124 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I1109 05:31:06.568981 124 cuda_memory_manager.cc:115] CUDA memory pool disabled
I1109 05:31:06.569292 124 tritonserver.cc:2176]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.24.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
| | or_data statistics trace |
| model_repository_path[0] | /opt/tritonserver/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I1109 05:31:06.569348 124 server.cc:257] No server context available. Exiting immediately.
error: creating server: Internal - failed to stat file /opt/tritonserver/models
Expected Behavior
I'm following the official documentation to deploy triton server and start towhee to speed up coding.
I got an error in step “Start the Triton server”after entering the server.
But I can use towhee for encoding in the local environment if I don't use the triton server method. It prompts whether the cuda driver and version in the error message is the reason why it cannot be executed. How can I continue the operation?
Steps To Reproduce
1.Build Image
from towhee import pipe, ops, AutoConfig
import numpy as np
p = (
pipe.input('text')
.map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
.map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
.output('vec')
)
import towhee
towhee.build_docker_image(
dc_pipeline=p,
image_name='clip:v1',
cuda_version='11.7', # '117dev' for developer
format_priority=['onnx'],
parallelism=4,
inference_server='triton'
)
2.Create models
import towhee
from towhee import pipe, ops, AutoConfig
import numpy as np
p = (
pipe.input('text')
.map('text', 'vec', ops.sentence_embedding.sbert(model_name='paraphrase-multilingual-mpnet-base-v2'), config=AutoConfig.TritonGPUConfig())
.map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
.output('vec')
)
towhee.build_pipeline_model(
dc_pipeline=p,
model_root='models',
format_priority=['onnx'],
parallelism=4,
server='triton'
)
Environment
- Towhee version(1.1.2):
- OS(Ubuntu or CentOS):Ubuntu
- GPU:3090
- tritonsever:22.07
- cuda:11.7
- Cuda Driver:535.129.03
(base) eg@eg-HP-Z8-G4-Workstation:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
(base) eg@eg-HP-Z8-G4-Workstation:~$ nvidia-smi
Thu Nov 9 14:03:48 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
root@85d70c862b32:/opt/tritonserver# nvcc -v
nvcc fatal : No input files specified; use option --help for more information
root@85d70c862b32:/opt/tritonserver# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
root@85d70c862b32:/opt/tritonserver# nvidia-smi
bash: nvidia-smi: command not found
Anything else?
No response
This problem was solved after I restarted the container, but a new error occurred when executing the program.
Traceback (most recent call last):
File "/home/eg/PycharmProjects/Towhee/triton_endcod.py", line 8, in
res = client(data)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 81, in call
return self._loop.run_until_complete(self._call(inputs))[0]
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/towhee/serve/triton/pipeline_client.py", line 68, in _call
response = await self._client.infer(self._model_name, inputs)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 757, in infer
response = await self._post(
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/tritonclient/http/aio/init.py", line 209, in _post
res = await self._stub.post(
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client.py", line 586, in _request
await resp.start(conn)
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 920, in start
self._continue = None
File "/home/eg/anaconda3/envs/towhee38/lib/python3.8/site-packages/aiohttp/helpers.py", line 725, in exit
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
It seems that access to the triton server timeout. Are there any logs on the server?
It seems that access to the triton server timeout. Are there any logs on the server?
docker logs shows that:
NVIDIA Release 22.07 (build 41737377)
Triton Server Version 2.24.0
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
I1109 06:53:09.532688 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6a4e000000' with size 268435456
I1109 06:53:09.533016 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1109 06:53:09.536004 1 model_repository_manager.cc:1206] loading: pipeline:1
I1109 06:53:09.536049 1 model_repository_manager.cc:1206] loading: sentence-embedding.sbert-0:1
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:11.225232 1 onnxruntime.cc:2458] TRITONBACKEND_Initialize: onnxruntime
I1109 06:53:11.225295 1 onnxruntime.cc:2468] Triton TRITONBACKEND API version: 1.10
I1109 06:53:11.225317 1 onnxruntime.cc:2474] 'onnxruntime' TRITONBACKEND API version: 1.10
I1109 06:53:11.225331 1 onnxruntime.cc:2504] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I1109 06:53:11.259270 1 onnxruntime.cc:2560] TRITONBACKEND_ModelInitialize: sentence-embedding.sbert-0 (version 1)
W1109 06:53:14.630221 1 onnxruntime.cc:787] autofilled max_batch_size to 4 for model 'sentence-embedding.sbert-0' since batching is supporrted but no max_batch_size is specified in model configuration. Must specify max_batch_size to utilize autofill with a larger max batch size
I1109 06:53:14.685000 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_0 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:17.996107 1 onnxruntime.cc:2603] TRITONBACKEND_ModelInstanceInitialize: sentence-embedding.sbert-0_0 (GPU device 0)
I1109 06:53:20.312004 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_1 (CPU device 0)
I1109 06:53:20.312255 1 model_repository_manager.cc:1352] successfully loaded 'sentence-embedding.sbert-0' version 1
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:23.568245 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_2 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:26.839855 1 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: pipeline_0_3 (CPU device 0)
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (2.0.7) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
I1109 06:53:30.081773 1 model_repository_manager.cc:1352] successfully loaded 'pipeline' version 1
I1109 06:53:30.082043 1 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1109 06:53:30.082215 1 server.cc:586]
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b |
| | | ackends","default-max-batch-size":"4"}} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/b |
| | | ackends","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
I1109 06:53:30.082348 1 server.cc:629]
+----------------------------+---------+--------+
| Model | Version | Status |
+----------------------------+---------+--------+
| pipeline | 1 | READY |
| sentence-embedding.sbert-0 | 1 | READY |
+----------------------------+---------+--------+
I1109 06:53:30.135753 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I1109 06:53:30.136027 1 tritonserver.cc:2176]
I1109 06:53:30.137643 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I1109 06:53:30.137940 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I1109 06:53:30.179419 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
curl http://0.0.0.0:8000/v2/models/stats Check the server is available
curl http://0.0.0.0:8000/v2/models/stats Check the server is available
I set the local port to 8010. So I can get such a result, what may be the cause of the error in this case, thank you for your help.
(base) eg@eg-HP-Z8-G4-Workstation:~$ curl http://0.0.0.0:8010/v2/models/stats
{"model_stats":[{"name":"pipeline","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]},{"name":"sentence-embedding.sbert-0","version":"1","last_inference":0,"inference_count":0,"execution_count":0,"inference_stats":{"success":{"count":0,"ns":0},"fail":{"count":0,"ns":0},"queue":{"count":0,"ns":0},"compute_input":{"count":0,"ns":0},"compute_infer":{"count":0,"ns":0},"compute_output":{"count":0,"ns":0},"cache_hit":{"count":0,"ns":0},"cache_miss":{"count":0,"ns":0}},"batch_stats":[]}]}
Try
ops.sentence_embedding.transformers
,sbert
has some bugs. This pipeline works fine.
Thank you for your help. I think my problem has been resolved. My other question is, which parameters can further improve the encoding speed by accelerating model inference through the Triton server in parameter settings.
It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server
It is possible to optimize performance by adjusting parameters such as the number of instances and batch size. For more information, please refer to the Triton documentation: https://github.com/triton-inference-server/server
Thank you very much for your help. I think my problem has been resolved.