deepjavalibrary/djl

How to serv with 2 model folders?

SidneyLann opened this issue · 17 comments

image
I have 2 model folders for llama3, one is the original and another is the finetuned, how to config to use the 2 model folders for djl-serving?

You can have multiple models in the same folder:

├── model-root
│   ├── Model1
│   │   ├── model.py
│   ├── Model2
│   │   ├── model.py

And you can start djl-serving:

djl-serving -m model-root

If you want to use workflow to chain tow models, you see this example: https://github.com/deepjavalibrary/djl-demo/tree/master/djl-serving

image
image
image

Error: Adapters are only currently supported for Python models
The llama3 and its lora fine tuned model should be python model, right? or Pytorch model is not a python model? How to solve this issue?

You should be using the python engine. Can you share your serving.properties?

engine=PyTorch
option.enable_lora=true
blockFactory=ai.djl.nn.IdentityBlockFactory
option.hasParameter=false
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4
job_queue_size=10
log_request_metric=true
metrics_aggregation=1

You should use our lmi container: deepjavalibrary/djl-serving:0.29.0-lmi

docker run -it --gpus all -v model_en:/opt/ml/model -p 8080:8080 deepjavalibrary/djl-serving:0.29.0-lmi

curl -X POST http://127.0.0.1:8080/invocations \
    -H "Content-Type: application/json" \
    -d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'

serving.properties is optional if you use container, but you should use engine=Python

engine=Python
option.enable_lora=true
minWorkers=1
maxWorkers=1
job_queue_size=10
log_request_metric=true
metrics_aggregation=100

image
image
Why the ValueError shown in INFO, is the serving start successfully?

curl -X POST http://127.0.0.1:8080/invocations
-H "Content-Type: application/json"
-d '{"inputs": ["How is the weather"], "parameters": {"max_new_tokens": 25}}'

{
"code":424,
"message":"prediction failure",
"error":"Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)"
}
Why the serving program can't auto select the device?

The model loaded successfully, however, it failed during inference.

  1. Can you try our container? our container has all necessary python packages installed
  2. Are you able to run the model in python?
  3. Are you using an opensource model? Can you share the huggingface model id?

I had shown env info in last posts. Can I set python executor in config.properties?

image

we don't have python_executor configuration, if you want to use different python, you need to set env var PYTHON_EXECUTABLE

[sidney@tech68 ~]$ ./app/serving-djl/bin/serving -m /home/sidney/app/idea/model_root/model_en
INFO ModelServer Starting model server ...
INFO ModelServer Starting djl-serving: 0.29.0 ...
INFO ModelServer
Model server home: /home/sidney/app/serving-djl
Current directory: /home/sidney
Temp directory: /tmp
Command line: -Dlog4j.configurationFile=/home/sidney/app/serving-djl/conf/log4j2.xml
Number of CPUs: 16
CUDA version: 123 / 61
Number of GPUs: 1
Max heap size: 7920
Config file: /home/sidney/app/serving-djl/conf/config.properties
Inference address: http://127.0.0.1:8090
Management address: http://127.0.0.1:8090
Default job_queue_size: 1000
Default batch_size: 1
Default max_batch_delay: 100
Default max_idle_time: 60
Model Store: /home/sidney/app/idea/model_root/model_en
Initial Models: /home/sidney/app/idea/model_root/model_en
Netty threads: 0
Maximum Request Size: 67108864
Environment variables:
PYTHON_EXECUTABLE: python310
OMP_NUM_THREADS: 1
INFO FolderScanPluginManager scanning for plugins...
INFO FolderScanPluginManager scanning in plug-in folder :/home/sidney/app/serving-djl/plugins
INFO PropertyFilePluginMetaDataReader Plugin found: plugin-management/jar:file:/home/sidney/app/serving-djl/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: console/jar:file:/home/sidney/app/serving-djl/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: cache-engines/jar:file:/home/sidney/app/serving-djl/plugins/cache-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: static-file-plugin/jar:file:/home/sidney/app/serving-djl/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: kserve/jar:file:/home/sidney/app/serving-djl/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition
INFO PropertyFilePluginMetaDataReader Plugin found: secure-mode/jar:file:/home/sidney/app/serving-djl/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition
INFO FolderScanPluginManager Loading plugin: {console/jar:file:/home/sidney/app/serving-djl/plugins/management-console-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin console changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {plugin-management/jar:file:/home/sidney/app/serving-djl/plugins/plugin-management-plugin-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin plugin-management changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {static-file-plugin/jar:file:/home/sidney/app/serving-djl/plugins/static-file-plugin-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin static-file-plugin changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {cache-engines/jar:file:/home/sidney/app/serving-djl/plugins/cache-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin cache-engines changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {secure-mode/jar:file:/home/sidney/app/serving-djl/plugins/secure-mode-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin secure-mode changed state to INITIALIZED
INFO FolderScanPluginManager Loading plugin: {kserve/jar:file:/home/sidney/app/serving-djl/plugins/kserve-0.29.0.jar!/META-INF/plugin.definition}
INFO PluginMetaData plugin kserve changed state to INITIALIZED
INFO PluginMetaData plugin console changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin plugin-management changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin static-file-plugin changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin cache-engines changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin secure-mode changed state to ACTIVE reason: plugin ready
INFO PluginMetaData plugin kserve changed state to ACTIVE reason: plugin ready
INFO FolderScanPluginManager 6 plug-ins found and loaded.
INFO ModelServer Initializing model: /home/sidney/app/idea/model_root/model_en
INFO LmiUtils Detected mpi_mode: null, rolling_batch: disable, tensor_parallel_degree 1, for modelType: llama
INFO ModelInfo M-0001: Apply per model settings:
job_queue_size: 10
max_dynamic_batch_size: 1
max_batch_delay: 100
max_idle_time: 60
load_on_devices: *
engine: Python
mpi_mode: null
option.entryPoint: null
maxWorkers: 1
log_request_metric: true
option.tensor_parallel_degree: 1
option.max_rolling_batch_size: 32
option.pipeline_parallel_degree: 1
option.enable_lora: true
minWorkers: 1
metrics_aggregation: 100
option.rolling_batch: disable
INFO Platform Found matching platform from: jar:file:/home/sidney/app/serving-djl/lib/python-0.29.0.jar!/native/lib/python.properties
INFO ModelManager Loading model on Python:[0]
INFO WorkerPool loading model model_en (M-0001, PENDING) on gpu(0) ...
INFO ModelInfo M-0001: Available CPU memory: 28964 MB, required: 0 MB, reserved: 500 MB
INFO ModelInfo M-0001: Available GPU memory: 11018 MB, required: 0 MB, reserved: 500 MB
INFO ModelInfo Loading model model_en M-0001 on gpu(0)
INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO PyProcess Start process: 19000 - retry: 0
INFO Connection Set CUDA_VISIBLE_DEVICES=0
INFO PyProcess W-3502398-model_en-stdout: 3502398 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/home/sidney/app/idea/model_root/model_en', '--entry-point', '', '--device-id', '0', '--cluster-size', '1', '--recommended-entry-point', 'djl_python.huggingface']
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.998369: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.998398: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:36.999092: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARN PyProcess W-3502398-model_en-stderr: 2024-08-02 05:29:37.526235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
INFO PyProcess W-3502398-model_en-stdout: Python engine started.
INFO PyProcess W-3502398-model_en-stdout: Using 1 gpus collectively.
WARN PyProcess W-3502398-model_en-stderr: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO PyProcess W-3502398-model_en-stdout: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
WARN PyProcess W-3502398-model_en-stderr:
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 25%|██▌ | 1/4 [00:58<02:56, 58.98s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 50%|█████ | 2/4 [01:33<01:28, 44.33s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 75%|███████▌ | 3/4 [02:08<00:40, 40.27s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 100%|██████████| 4/4 [02:15<00:00, 27.22s/it]
WARN PyProcess W-3502398-model_en-stderr: Loading checkpoint shards: 100%|██████████| 4/4 [02:15<00:00, 33.93s/it]
INFO PyProcess W-3502398-model_en-stdout: Some parameters are on the meta device device because they were offloaded to the cpu.
INFO PyProcess W-3502398-model_en-stdout: image_placeholder_token is not explicitly set. It is highly recommended to explicitlyset the image_placeholder_token as it differs between models, and is not easy to infer from the model or tokenizer
INFO PyProcess W-3502398-model_en-stdout: could not infer image token from the model artifacts. Using as default.
INFO PyProcess Model [model_en] initialized.
INFO WorkerThread Starting worker thread WT-0001 for model model_en (M-0001, READY) on device gpu(0)
INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO PyProcess W-3502398-model_en-stdout: Registering adapter model_gen from /home/sidney/app/idea/model_root/model_en/adapters/model_gen
INFO ModelServer BOTH API bind to: http://127.0.0.1:8090
INFO PyProcess W-3502398-model_en-stdout: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
INFO PyProcess W-3502398-model_en-stdout: Failed invoke service.invoke_handler()
INFO PyProcess W-3502398-model_en-stdout: Traceback (most recent call last):
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python_engine.py", line 154, in run_server
INFO PyProcess W-3502398-model_en-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-3502398-model_en-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 507, in register_adapter
INFO PyProcess W-3502398-model_en-stdout: _service.model = PeftModel.from_pretrained(_service.model,
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
INFO PyProcess W-3502398-model_en-stdout: model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/peft/peft_model.py", line 760, in load_adapter
INFO PyProcess W-3502398-model_en-stdout: dispatch_model(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/accelerate/big_modeling.py", line 376, in dispatch_model
INFO PyProcess W-3502398-model_en-stdout: raise ValueError(
INFO PyProcess W-3502398-model_en-stdout: ValueError: We need an offload_dir to dispatch this model according to this device_map, the following submodules need to be offloaded: base_model.model.model.norm, base_model.model.lm_head, base_model.model.model.layers.
WARN PyProcess W-3502398-model_en-stderr: Setting pad_token_id to eos_token_id:128001 for open-end generation.
INFO PyProcess W-3502398-model_en-stdout: Failed invoke service.invoke_handler()
INFO PyProcess W-3502398-model_en-stdout: Traceback (most recent call last):
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python_engine.py", line 154, in run_server
INFO PyProcess W-3502398-model_en-stdout: outputs = self.service.invoke_handler(function_name, inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/service_loader.py", line 30, in invoke_handler
INFO PyProcess W-3502398-model_en-stdout: return getattr(self.module, function_name)(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 544, in handle
INFO PyProcess W-3502398-model_en-stdout: return _service.inference(inputs)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 230, in inference
INFO PyProcess W-3502398-model_en-stdout: return self._dynamic_batch_inference(parsed_input.batch, errors,
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 245, in _dynamic_batch_inference
INFO PyProcess W-3502398-model_en-stdout: prediction = self.hf_pipeline(input_data, **parameters)
INFO PyProcess W-3502398-model_en-stdout: File "/home/sidney/.djl.ai/python/0.29.0/djl_python/huggingface.py", line 414, in wrapped_pipeline
INFO PyProcess W-3502398-model_en-stdout: output_tokens = model.generate(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
INFO PyProcess W-3502398-model_en-stdout: return func(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/generation/utils.py", line 1914, in generate
INFO PyProcess W-3502398-model_en-stdout: result = self._sample(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/generation/utils.py", line 2651, in _sample
INFO PyProcess W-3502398-model_en-stdout: outputs = self(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1174, in forward
INFO PyProcess W-3502398-model_en-stdout: outputs = self.model(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 978, in forward
INFO PyProcess W-3502398-model_en-stdout: layer_outputs = decoder_layer(
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 715, in forward
INFO PyProcess W-3502398-model_en-stdout: hidden_states = self.input_layernorm(hidden_states)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
INFO PyProcess W-3502398-model_en-stdout: return self._call_impl(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
INFO PyProcess W-3502398-model_en-stdout: return forward_call(*args, **kwargs)
INFO PyProcess W-3502398-model_en-stdout: File "/usr/prg/python/3102/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 88, in forward
INFO PyProcess W-3502398-model_en-stdout: return self.weight * hidden_states.to(input_dtype)
INFO PyProcess W-3502398-model_en-stdout: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

./app/serving-djl/bin/serving -m /home/sidney/app/idea/model_root/model_en

curl -X POST http://127.0.0.1:8090/invocations
-H "Content-Type: application/json"
-d '{"inputs": ["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAt date 20230601<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow is the weather<|eot_id|>"], "parameters": {"max_new_tokens": 255}}'

{
"code":424,
"message":"prediction failure",
"error":"Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"
}

Same error for both python 3.9 and 3.10 when run djl-serving for llama3 model which in local folders.

You still have this error during inference:

ValueError: We need an offload_dir to dispatch this model according to this device_map, the following submodules need to be offloaded: base_model.model.model.norm,

Does your model works using pure python? Does the peft library version support your model?

model, tokenizer = FastLanguageModel.from_pretrained(
model_name=config.get("model_config").get("finetuned_model"),
max_seq_length=config.get("model_config").get("max_seq_length"),
dtype=config.get("model_config").get("dtype"),
load_in_4bit=config.get("model_config").get("load_in_4bit"),
)

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
"<|start_header_id|>system<|end_header_id|>\n\nAt date 20240801<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThis is the question: Can you provide an overview of the lung's squamous cell carcinoma?<|eot_id|>"
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(outputs[0])

image

Can run pure python successfully.

ValueError: We need an offload_dir to dispatch this model according to this device_map, the following submodul-------this error is at serving server start time, not in inference time.