Continuous Batching does not work with newest transformer issue
udaij12 opened this issue ยท 0 comments
udaij12 commented
๐ Describe the bug
Newest version of transformer
(v4.41.0) causes the continuous batching pytests to fail.
Error logs
2024-05-20T22:21:22,515 [INFO ] W-9000-streaming_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1716243682515
2024-05-20T22:21:22,516 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - Backend received inference at: 1716243682
2024-05-20T22:21:22,869 [WARN ] W-9000-streaming_handler_1.0-stderr MODEL_LOG - Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
2024-05-20T22:21:22,870 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - Invoking custom service failed.
2024-05-20T22:21:22,870 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-05-20T22:21:22,870 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/ts/service.py", line 134, in predict
2024-05-20T22:21:22,870 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - ret = self._entry_point(input_batch, self.context)
2024-05-20T22:21:22,870 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 431, in handle
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - output = self.inference(data_preprocess)
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/tmp/models/9e429ad562f84475ad7b4600b1b092d2/stream_handler.py", line 78, in inference
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - results[req_id] = self._run_prefill(req_id)
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0 ACCESS_LOG - /127.0.0.1:35514 "POST /predictions/streaming_handler HTTP/1.1" 503 459
2024-05-20T22:21:22,871 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - return func(*args, **kwargs)
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/tmp/models/9e429ad562f84475ad7b4600b1b092d2/stream_handler.py", line 98, in _run_prefill
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0 TS_METRICS - Requests5XX.Count:1.0|#Level:Host|#hostname:fv-az1271-393,timestamp:1716243682
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - output = self.model.generate(
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-05-20T22:21:22,872 [DEBUG] W-9000-streaming_handler_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 100288593, Inference time ns: 457870956
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - return func(*args, **kwargs)
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 355
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1736, in generate
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:fv-az1271-393,timestamp:1716243682
2024-05-20T22:21:22,872 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - result = self._sample(
2024-05-20T22:21:22,873 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/transformers/generation/utils.py", line 2368, in _sample
2024-05-20T22:21:22,873 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs)
2024-05-20T22:21:22,873 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/transformers/generation/utils.py", line 1321, in _get_initial_cache_position
2024-05-20T22:21:22,873 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - past_length = model_kwargs["past_key_values"][0][0].shape[2]
2024-05-20T22:21:22,873 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_LOG - TypeError: 'NoneType' object is not subscriptable
2024-05-20T22:21:22,881 [INFO ] epollEventLoopGroup-3-3 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:streaming_handler,model_version:default|#hostname:fv-az1271-393,timestamp:1716243682
Installation instructions
Noticed on Torchserve binary regression CIs both CPU and GPU
Model Packaging
Noticed on Torchserve binary regression CIs both CPU and GPU
config.properties
No response
Versions
Noticed on Torchserve binary regression CIs both CPU and GPU
Repro instructions
Examples search test_echo_stream_inference
in the following CI runs:
https://github.com/pytorch/serve/actions/runs/9165678469/job/25199613724
https://github.com/pytorch/serve/actions/runs/9145615468/job/25144773494
Possible Solution
Temporary Solution is to pin transformers
to 4.40.2