When running llama2 7b, inference some 2k length prompt concurrently will cause TGI service crash.
yao531441 opened this issue · 6 comments
System Info
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
Error log
2024-08-30T02:09:44.146922Z INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558739096s" validation_time="2.976352ms" queue_time="23.703336184s" inference_time="28.852426791s" time_per_token="57.704853ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:44.877111Z INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558665697s" validation_time="1.514834ms" queue_time="23.709660453s" inference_time="28.847490833s" time_per_token="57.694981ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:45.863818Z ERROR text_generation_launcher: Method Decode encountered an error.
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
return await response
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
generations, next_batch, timings = self.model.generate_token(batches)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
batch.logits = self.forward(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
return self.model.forward(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
return wrapped_hpugraph_forward(
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
cached.graph.replayV3(input_tensor_list, cached.asynchronous)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
_hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078
2024-08-30T02:09:46.039747Z ERROR batch{batch_size=16}:decode:decode{size=16}:decode{size=16}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-08-30T02:09:47.968375Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
2024-08-30T02:09:47.968553Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(72)}:clear_cache{batch_id=Some(72)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.968584Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968613Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968632Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968649Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
batch{batch_size=1}:prefill:clear_cache{batch_id=Some(74)}:clear_cache{batch_id=Some(74)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969441Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969517Z ERROR batch{batch_size=1}:prefill:prefill{id=75 size=1}:prefill{id=75 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969560Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(75)}:clear_cache{batch_id=Some(75)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969575Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969645Z ERROR batch{batch_size=1}:prefill:prefill{id=76 size=1}:prefill{id=76 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969705Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(76)}:clear_cache{batch_id=Some(76)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969720Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969791Z ERROR batch{batch_size=1}:prefill:prefill{id=77 size=1}:prefill{id=77 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969834Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(77)}:clear_cache{batch_id=Some(77)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969849Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969917Z ERROR batch{batch_size=1}:prefill:prefill{id=78 size=1}:prefill{id=78 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969955Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(78)}:clear_cache{batch_id=Some(78)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969970Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970036Z ERROR batch{batch_size=1}:prefill:prefill{id=79 size=1}:prefill{id=79 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970078Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(79)}:clear_cache{batch_id=Some(79)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970094Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970160Z ERROR batch{batch_size=1}:prefill:prefill{id=80 size=1}:prefill{id=80 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970198Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(80)}:clear_cache{batch_id=Some(80)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970213Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970278Z ERROR batch{batch_size=1}:prefill:prefill{id=81 size=1}:prefill{id=81 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970318Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(81)}:clear_cache{batch_id=Some(81)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970334Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.000537Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM : 2113389016 KB
------------------------------------------------------------------------------
Exception ignored in: <function Server.__del__ at 0x7f611e95c790>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_server.py", line 194, in __del__
cygrpc.schedule_coro_threadsafe(
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
File "/usr/lib/python3.10/asyncio/base_events.py", line 436, in create_task
self._check_closed()
File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Decode]' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
return await response
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
generations, next_batch, timings = self.model.generate_token(batches)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
batch.logits = self.forward(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
return self.model.forward(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
return wrapped_hpugraph_forward(
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
cached.graph.replayV3(input_tensor_list, cached.asynchronous)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
_hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
asyncio.run(
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 831, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 33, in intercept
exit(1)
File "/usr/lib/python3.10/_sitebuiltins.py", line 26, in __call__
raise SystemExit(code)
SystemExit: 1 rank=0
2024-08-30T02:09:48.006377Z ERROR batch{batch_size=1}:prefill:prefill{id=82 size=1}:prefill{id=82 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006461Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(82)}:clear_cache{batch_id=Some(82)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006484Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062231Z ERROR batch{batch_size=1}:prefill:prefill{id=118 size=1}:prefill{id=118 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062267Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(118)}:clear_cache{batch_id=Some(118)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062276Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062891Z ERROR batch{batch_size=1}:prefill:prefill{id=119 size=1}:prefill{id=119 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062914Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(119)}:clear_cache{batch_id=Some(119)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062921Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063923Z ERROR batch{batch_size=1}:prefill:prefill{id=120 size=1}:prefill{id=120 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063944Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(120)}:clear_cache{batch_id=Some(120)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063951Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065252Z ERROR batch{batch_size=1}:prefill:prefill{id=121 size=1}:prefill{id=121 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065270Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(121)}:clear_cache{batch_id=Some(121)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065275Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076439Z ERROR batch{batch_size=1}:prefill:prefill{id=122 size=1}:prefill{id=122 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076463Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(122)}:clear_cache{batch_id=Some(122)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076470Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.157110Z INFO text_generation_launcher: webserver terminated
2024-08-30T02:09:48.157132Z INFO text_generation_launcher: Shutting down shards
Error: ShardFailed
Expected behavior
TGI serve will return correct output result.
@yuanwu2017 is looking into it.
I can reproduce this issue. It is OOM issue. Debugging in progress.