When running llama2 7b, inference some 2k length prompt concurrently will cause TGI service crash.

Question

When running llama2 7b, inference some 2k length prompt concurrently will cause TGI service crash.

yao531441 opened this issue 4 months ago · 6 comments

yao531441 commented 4 months ago

System Info

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64

Error log

2024-08-30T02:09:44.146922Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558739096s" validation_time="2.976352ms" queue_time="23.703336184s" inference_time="28.852426791s" time_per_token="57.704853ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:44.877111Z  INFO generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="52.558665697s" validation_time="1.514834ms" queue_time="23.709660453s" inference_time="28.847490833s" time_per_token="57.694981ms" seed="None"}: text_generation_router::server: router/src/server.rs:513: Success
2024-08-30T02:09:45.863818Z ERROR text_generation_launcher: Method Decode encountered an error.
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
    generations, next_batch, timings = self.model.generate_token(batches)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
    batch.logits = self.forward(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
    return self.model.forward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
    cached.graph.replayV3(input_tensor_list, cached.asynchronous)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
    _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078
2024-08-30T02:09:46.039747Z ERROR batch{batch_size=16}:decode:decode{size=16}:decode{size=16}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
2024-08-30T02:09:47.968375Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
2024-08-30T02:09:47.968553Z ERROR batch{batch_size=16}:decode:clear_cache{batch_id=Some(72)}:clear_cache{batch_id=Some(72)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.968584Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968613Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968632Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED
2024-08-30T02:09:47.968649Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: CANCELLED

batch{batch_size=1}:prefill:clear_cache{batch_id=Some(74)}:clear_cache{batch_id=Some(74)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969441Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969517Z ERROR batch{batch_size=1}:prefill:prefill{id=75 size=1}:prefill{id=75 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969560Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(75)}:clear_cache{batch_id=Some(75)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969575Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969645Z ERROR batch{batch_size=1}:prefill:prefill{id=76 size=1}:prefill{id=76 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969705Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(76)}:clear_cache{batch_id=Some(76)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969720Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969791Z ERROR batch{batch_size=1}:prefill:prefill{id=77 size=1}:prefill{id=77 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969834Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(77)}:clear_cache{batch_id=Some(77)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969849Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969917Z ERROR batch{batch_size=1}:prefill:prefill{id=78 size=1}:prefill{id=78 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969955Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(78)}:clear_cache{batch_id=Some(78)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.969970Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970036Z ERROR batch{batch_size=1}:prefill:prefill{id=79 size=1}:prefill{id=79 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970078Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(79)}:clear_cache{batch_id=Some(79)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970094Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970160Z ERROR batch{batch_size=1}:prefill:prefill{id=80 size=1}:prefill{id=80 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970198Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(80)}:clear_cache{batch_id=Some(80)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970213Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970278Z ERROR batch{batch_size=1}:prefill:prefill{id=81 size=1}:prefill{id=81 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970318Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(81)}:clear_cache{batch_id=Some(81)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:47.970334Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.000537Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM       : 2113389016 KB
------------------------------------------------------------------------------
Exception ignored in: <function Server.__del__ at 0x7f611e95c790>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/grpc/aio/_server.py", line 194, in __del__
    cygrpc.schedule_coro_threadsafe(
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "/usr/lib/python3.10/asyncio/base_events.py", line 436, in create_task
    self._check_closed()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
Task exception was never retrieved
future: <Task finished name='HandleExceptions[/generate.v2.TextGenerationService/Decode]' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 154, in Decode
    generations, next_batch, timings = self.model.generate_token(batches)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 997, in generate_token
    batch.logits = self.forward(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 870, in forward
    return self.model.forward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 643, in wrapped_hpugraph_forward
    cached.graph.replayV3(input_tensor_list, cached.asynchronous)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 76, in replayV3
    _hpu_C.replayV3(self.hpu_graph, tlistI, asynchronous)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Launch thread...
Check $HABANA_LOGS/ for details[Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::1073741824 (1024)MB
[Rank:0] Habana exception raised from get_pointer at device_memory.cpp:1078

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 137, in serve
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 256, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 831, in _handle_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 33, in intercept
    exit(1)
  File "/usr/lib/python3.10/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 1 rank=0
2024-08-30T02:09:48.006377Z ERROR batch{batch_size=1}:prefill:prefill{id=82 size=1}:prefill{id=82 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006461Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(82)}:clear_cache{batch_id=Some(82)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.006484Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)





2024-08-30T02:09:48.062231Z ERROR batch{batch_size=1}:prefill:prefill{id=118 size=1}:prefill{id=118 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062267Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(118)}:clear_cache{batch_id=Some(118)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062276Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062891Z ERROR batch{batch_size=1}:prefill:prefill{id=119 size=1}:prefill{id=119 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062914Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(119)}:clear_cache{batch_id=Some(119)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.062921Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063923Z ERROR batch{batch_size=1}:prefill:prefill{id=120 size=1}:prefill{id=120 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063944Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(120)}:clear_cache{batch_id=Some(120)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.063951Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065252Z ERROR batch{batch_size=1}:prefill:prefill{id=121 size=1}:prefill{id=121 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065270Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(121)}:clear_cache{batch_id=Some(121)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.065275Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076439Z ERROR batch{batch_size=1}:prefill:prefill{id=122 size=1}:prefill{id=122 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076463Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(122)}:clear_cache{batch_id=Some(122)}: text_generation_client: router/client/src/lib.rs:33: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.076470Z ERROR generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:async_stream:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:876: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111)
2024-08-30T02:09:48.157110Z  INFO text_generation_launcher: webserver terminated
2024-08-30T02:09:48.157132Z  INFO text_generation_launcher: Shutting down shards
Error: ShardFailed

Expected behavior

TGI serve will return correct output result.

Answer 1 · 2024-09-03T01:54:12.000Z

@yuanwu2017 is looking into it.

Answer 2 · 2024-09-06T07:37:25.000Z

I can reproduce this issue. It is OOM issue. Debugging in progress.