How to serv with 2 model folders?
SidneyLann opened this issue · 17 comments
You can have multiple models in the same folder:
├── model-root
│ ├── Model1
│ │ ├── model.py
│ ├── Model2
│ │ ├── model.py
And you can start djl-serving:
djl-serving -m model-root
If you want to use workflow to chain tow models, you see this example: https://github.com/deepjavalibrary/djl-demo/tree/master/djl-serving
You should be using the python engine. Can you share your serving.properties
?
engine=PyTorch
option.enable_lora=true
blockFactory=ai.djl.nn.IdentityBlockFactory
option.hasParameter=false
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4
job_queue_size=10
log_request_metric=true
metrics_aggregation=1
You should use our lmi container: deepjavalibrary/djl-serving:0.29.0-lmi
docker run -it --gpus all -v model_en:/opt/ml/model -p 8080:8080 deepjavalibrary/djl-serving:0.29.0-lmi
curl -X POST http://127.0.0.1:8080/invocations \
-H "Content-Type: application/json" \
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'
serving.properties
is optional if you use container, but you should use engine=Python
engine=Python
option.enable_lora=true
minWorkers=1
maxWorkers=1
job_queue_size=10
log_request_metric=true
metrics_aggregation=100
curl -X POST http://127.0.0.1:8080/invocations
-H "Content-Type: application/json"
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'
{
"code":424,
"message":"prediction failure",
"error":"Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)"
}
Why the serving program can't auto select the device?
The model loaded successfully, however, it failed during inference.
- Can you try our container? our container has all necessary python packages installed
- Are you able to run the model in python?
- Are you using an opensource model? Can you share the huggingface model id?
we don't have python_executor
configuration, if you want to use different python, you need to set env var PYTHON_EXECUTABLE
[sidney@tech68 ~]$ ./app/serving-djl/bin/serving -m /home/sidney/app/idea/model_root/model_en
INFO ModelServer Starting model server ...
INFO ModelServer Starting djl-serving: 0.29.0 ...
INFO ModelServer
Model server home: /home/sidney/app/serving-djl
Current directory: /home/sidney
Temp directory: /tmp
Command line: -Dlog4j.configurationFile=/home/sidney/app/serving-djl/conf/log4j2.xml
Number of CPUs: 16
CUDA version: 123 / 61
Number of GPUs: 1
Max heap size: 7920
Config file: /home/sidney/app/serving-djl/conf/config.properties
Inference address: http://127.0.0.1:8090
Management address: http://127.0.0.1:8090
Default job_queue_size: 1000
Default batch_size: 1
Default max_batch_delay: 100
Default max_idle_time: 60
Model Store: /home/sidney/app/idea/model_root/model_en
Initial Models: /home/sidney/app/idea/model_root/model_en
Netty threads: 0
Maximum Request Size: 67108864
Environment variables:
PYTHON_EXECUTABLE: python310
OMP_NUM_THREADS: 1
INFO FolderScanPluginManager scanning for plugins...
INFO FolderScanPluginManager scanning in plug-in folder :/home/sidney/app/serving-djl/plugins
INFO PropertyFilePluginMetaDataReader Plugin found: plugin-management/jar:file:/home/sidney/app/serving-djl/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: console/jar:file:/home/sidney/app/serving-djl/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: cache-engines/jar:file:/home/sidney/app/serving-djl/plugins/cache-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: static-file-plugin/jar:file:/home/sidney/app/serving-djl/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: kserve/jar:file:/home/sidney/app/serving-djl/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: secure-mode/jar:file:/home/sidney/app/serving-djl/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition
INFO FolderScanPluginManager Loading plugin: {console/jar:file:/home/sidney/app/serving-djl/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin console changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {plugin-management/jar:file:/home/sidney/app/serving-djl/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin plugin-management changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {static-file-plugin/jar:file:/home/sidney/app/serving-djl/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin static-file-plugin changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {cache-engines/jar:file:/home/sidney/app/serving-djl/plugins/cache-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin cache-engines changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {secure-mode/jar:file:/home/sidney/app/serving-djl/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin secure-mode changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {kserve/jar:file:/home/sidney/app/serving-djl/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin kserve changed state to INITIALIZED
INFO PluginMetaData plugin console changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin plugin-management changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin static-file-plugin changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin cache-engines changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin secure-mode changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin kserve changed state to ACTIVE reason: plugin ready
INFO FolderScanPluginManager 6 plug-ins found and loaded.
INFO ModelServer Initializing model: /home/sidney/app/idea/model_root/model_en
INFO LmiUtils Detected mpi_mode: null, rolling_batch: disable, tensor_parallel_degree 1, for modelType: llama
INFO ModelInfo M-0001: Apply per model settings:
job_queue_size: 10
max_dynamic_batch_size: 1
max_batch_delay: 100
max_idle_time: 60
load_on_devices: *
engine: Python
mpi_mode: null
option.entryPoint: null
maxWorkers: 1
log_request_metric: true
option.tensor_parallel_degree: 1
option.max_rolling_batch_size: 32
option.pipeline_parallel_degree: 1
option.enable_lora: true
minWorkers: 1
metrics_aggregation: 100
option.rolling_batch: disable
INFO Platform Found matching platform from: jar:file:/home/sidney/app/serving-djl/lib/python-0.29.0.jar!/native/lib/python.properties
INFO ModelManager Loading model on Python:[0]
INFO WorkerPool loading model model_en (M-0001, PENDING) on gpu(0) ...
INFO ModelInfo M-0001: Available CPU memory: 28964 MB, required: 0 MB, reserved: 500 MB
INFO ModelInfo M-0001: Available GPU memory: 11018 MB, required: 0 MB, reserved: 500 MB
INFO ModelInfo Loading model model_en M-0001 on gpu(0)
INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO PyProcess Start process: 19000 - retry: 0
INFO Connection Set CUDA_VISIBLE_DEVICES=0
INFO PyProcess W-3502398-model_en-stdout: 3502398 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/home/sidney/app/idea/model_root/model_en', '--entry-point', '', '--device-id', '0', '--cluster-size', '1', '--recommended-entry-point', 'djl_python.huggingface']
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.998369: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.998398: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.999092: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:37.526235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
INFO PyProcess W-3502398-model_en-stdout: Python engine started.
INFO PyProcess W-3502398-model_en-stdout: Using 1 gpus collectively.
WARN PyProcess W-3502398-model_en-stderr: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO PyProcess W-3502398-model_en-stdout: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory
in to a higher value to use more memory (at your own risk).
WARN PyProcess W-3502398-model_en-stderr:
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 25%|██▌ | 1/4 [00:58<02:56, 58.98s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 50%|█████ | 2/4 [01:33<01:28, 44.33s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 75%|███████▌ | 3/4 [02:08<00:40, 40.27s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 100%|██████████| 4/4 [02:15<00:00, 27.22s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 100%|██████████| 4/4 [02:15<00:00, 33.93s/it]
INFO PyProcess W-3502398-model_en-stdout: Some parameters are on the meta device device because they were offloaded to the cpu.
INFO PyProcess W-3502398-model_en-stdout: image_placeholder_token is not explicitly set. It is highly recommended to explicitlyset the image_placeholder_token as it differs between models, and is not easy to infer from the model or tokenizer
INFO PyProcess W-3502398-model_en-stdout: could not infer image token from the model artifacts. Using as default.
INFO PyProcess Model [model_en] initialized.
INFO WorkerThread Starting worker thread WT-0001 for model model_en (M-0001, READY) on device gpu(0)
INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO PyProcess W-3502398-model_en-stdout: Registering adapter model_gen from /home/sidney/app/idea/model_root/model_en/adapters/model_gen
INFO ModelServer BOTH API bind to: http://127.0.0.1:8090
INFO PyProcess W-3502398-model_en-stdout: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory
in to a higher value to use more memory (at your own risk).
INFO PyProcess W-3502398-model_en-stdout: Failed invoke service.invoke_handler()
INFO PyProcess W-3502398-model_en-stdout: Traceback (most recent call last):
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python_engine.py", line 154, in run_server
INFO PyProcess W-3502398-model_en-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-3502398-model_en-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 507, in register_adapter
INFO PyProcess W-3502398-model_en-stdout: _service.model = PeftModel.from_pretrained(_service.model,
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
INFO PyProcess W-3502398-model_en-stdout: model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/peft/peft_model.py", line 760, in load_adapter
INFO PyProcess W-3502398-model_en-stdout: dispatch_model(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/accelerate/big_modeling.py", line 376, in dispatch_model
INFO PyProcess W-3502398-model_en-stdout: raise ValueError(
INFO PyProcess W-3502398-model_en-stdout: ValueError: We need an offload_dir
to dispatch this model according to this device_map
, the following submodules need to be offloaded: base_model.model.model.norm, base_model.model.lm_head, base_model.model.model.layers.
WARN PyProcess W-3502398-model_en-stderr: Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
INFO PyProcess W-3502398-model_en-stdout: Failed invoke service.invoke_handler()
INFO PyProcess W-3502398-model_en-stdout: Traceback (most recent call last):
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python_engine.py", line 154, in run_server
INFO PyProcess W-3502398-model_en-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-3502398-model_en-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 544, in handle
INFO PyProcess W-3502398-model_en-stdout: return _service.inference(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 230, in inference
INFO PyProcess W-3502398-model_en-stdout: return self._dynamic_batch_inference(parsed_input.batch, errors,
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 245, in _dynamic_batch_inference
INFO PyProcess W-3502398-model_en-stdout: prediction = self.hf_pipeline(input_data, **parameters)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 414, in wrapped_pipeline
INFO PyProcess W-3502398-model_en-stdout: output_tokens = model.generate(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
INFO PyProcess W-3502398-model_en-stdout: return func(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/generation/utils.py", line 1914, in generate
INFO PyProcess W-3502398-model_en-stdout: result = self._sample(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/generation/utils.py", line 2651, in _sample
INFO PyProcess W-3502398-model_en-stdout: outputs = self(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1174, in forward
INFO PyProcess W-3502398-model_en-stdout: outputs = self.model(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 978, in forward
INFO PyProcess W-3502398-model_en-stdout: layer_outputs = decoder_layer(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 715, in forward
INFO PyProcess W-3502398-model_en-stdout: hidden_states = self.input_layernorm(hidden_states)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 88, in forward
INFO PyProcess W-3502398-model_en-stdout: return self.weight * hidden_states.to(input_dtype)
INFO PyProcess W-3502398-model_en-stdout: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
./app/serving-djl/bin/serving -m /home/sidney/app/idea/model_root/model_en
curl -X POST http://127.0.0.1:8090/invocations
-H "Content-Type: application/json"
-d '{"inputs": ["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date 20230601<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow is the weather<|eot_id|>"], "parameters": {"max_new_tokens": 255}}'
{
"code":424,
"message":"prediction failure",
"error":"Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"
}
Same error for both python 3.9 and 3.10 when run djl-serving for llama3 model which in local folders.
You still have this error during inference:
ValueError: We need an offload_dir to dispatch this model according to this device_map, the following submodules need to be offloaded: base_model.model.model.norm,
Does your model works using pure python? Does the peft library version support your model?
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=config.get("model_config").get("finetuned_model"),
max_seq_length=config.get("model_config").get("max_seq_length"),
dtype=config.get("model_config").get("dtype"),
load_in_4bit=config.get("model_config").get("load_in_4bit"),
)
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
"<|start_header_id|>system<|end_header_id|>\n\nAt date 20240801<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThis is the question: Can you provide an overview of the lung's squamous cell carcinoma?<|eot_id|>"
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(outputs[0])
Can run pure python successfully.
ValueError: We need an offload_dir to dispatch this model according to this device_map, the following submodul-------this error is at serving server start time, not in inference time.