Llama2 inference overhead time way too long

Question

Llama2 inference overhead time way too long

Closed this issue a year ago · 6 comments

Here is a cleaned up GitHub issue request:

Long context inference overhead with Llama NeuronX

I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.

Issue

However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.

Attempted fixes

I have tried the following to resolve the long context overhead:

Adjusted TorchServe config values for maxWorkers, maxBatchDelay, batchSize - no improvement
Increased max_length parameter to support longer sequences - no improvement
Tried different micro_batch_size and parallelism values - no improvement
Updated all NeuronX libraries to latest versions:

model-config.yaml

minWorkers: 2
maxWorkers: 8  #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help

handler:
    model_checkpoint_dir: "llama-2-13b-split"
    amp: "bf16"
    tp_degree: 6
    max_length: 100

#did not help either
# micro_batching:
#     micro_batch_size: 8
#     parallelism:
#         preprocess: 4
#         inference: 1
#         postprocess: 4

pip list

torch 1.13.1+cpu
torch-model-archiver 0.9.0b20231026
torch-neuronx 1.13.1.1.12.1
torch-workflow-archiver 0.2.11b20231026
torch-xla 1.13.1+torchneuronc
transformers-neuronx 0.8.268

Log files after startup instance

torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b
(aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.9.0
TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages
Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2
Temp directory: /tmp
Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 96
Max heap size: 30688 M
Python executable: /opt/aws_neuron_venv_pytorch/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Initial Models: llama-2-13b
Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 96
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: log
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Model config: N/A
2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b
2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5
2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b
2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b
2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b
2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded.
2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2
2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261
2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001.
2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704
2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799
2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208


... some time later when I call the API


2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458
2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about 
2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73
2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false

.....

2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS]
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472

Tested manual inference mode from tutorial - same overhead issue

Ask

Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.

The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation

Let me know if any other details would be helpful in troubleshooting this. Thanks!

Answer 1 · 2023-12-04T23:59:19.000Z

Also lmk if this is out of context for this repo

Answer 2 · 2023-12-07T01:00:39.000Z

Hi enochlev. This is not out of context for this repo :-) Are you seeing the 10-30 second overhead consistently for your 3000 token context or are you seeing the delay only when running a single inference? Sometimes there is a warm-up period for model execution.

Answer 3 · 2023-12-08T10:52:10.000Z

I appreciate your time and you responding

It is not a warmup delay, it is consistent with any nth inference

After further testing I have found out this is an issue when I run it with torchserve only
via
torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b

running model inference manually I do not see any delay

from transformers_neuronx.llama.model import LlamaForSampling, LlamaForCausalLM
neuron_model = LlamaForSampling.from_pretrained('model_store/llama-2-13b/llama-2-13b-split/', batch_size=2, tp_degree=6, amp='bf16')
neuron_model.load('model_store/llama-2-13b/neuron_artifacts')
neuron_model.to_neuron()



with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=len(input_ids[0]) + 80, top_k=50)
    elapsed = time.time() - start

print(f'generated sequences {generated_sequences} in {elapsed} seconds')

I even modified the entire inf2_handler.py to work outside the torchserve api via like this

``` import logging import os from abc import ABC from threading import Thread

import torch_neuronx
from transformers import AutoConfig, LlamaTokenizer
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
from transformers_neuronx.llama.model import LlamaForSampling

from ts.handler_utils.hf_batch_streamer import TextIteratorStreamerBatch
from ts.handler_utils.micro_batching import MicroBatching
from ts.protocol.otf_message_handler import send_intermediate_predict_response
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)


class LLMHandler(BaseHandler, ABC):
    
    #Transformers handler class for text completion streaming on Inferentia2
    def __init__(self):
        super().__init__()
        self.initialized = False
        self.max_length = None
        self.tokenizer = None
        self.output_streamer = None
        # enable micro batching
        self.handle = MicroBatching(self)

    def initialize(self):
        model_dir = 'model_store/llama-2-13b'
        model_checkpoint_path = 'model_store/llama-2-13b/llama-2-13b-split/'
        os.environ["NEURONX_CACHE"] = "on"
        os.environ["NEURONX_DUMP_TO"] = f"{model_dir}/neuron_cache"
        os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

        # micro batching initialization
        micro_batching_parallelism = True
        if micro_batching_parallelism:
            logger.info(
                f"Setting micro batching parallelism  from model_config_yaml: {micro_batching_parallelism}"
            )
            self.handle.parallelism = {'preprocess': 2, 'inference': 1, 'postprocess': 2}

        micro_batch_size = 2
        logger.info(f"Setting micro batching size: {micro_batch_size}")
        self.handle.micro_batch_size = micro_batch_size

        # settings for model compiliation and loading
        amp = "bf16"
        tp_degree = 6
        self.max_length = 50

        # allocate "tp_degree" number of neuron cores to the worker process
        os.environ["NEURON_RT_NUM_CORES"] = str(tp_degree)
        try:
            num_neuron_cores_available = (
                torch_neuronx.xla_impl.data_parallel.device_count()
            )
            assert num_neuron_cores_available >= int(tp_degree)
        except (RuntimeError, AssertionError) as error:
            logger.error(
                "Required number of neuron cores for tp_degree "
                + str(tp_degree)
                + " are not available: "
                + str(error)
            )

            raise error

        self.tokenizer = LlamaTokenizer.from_pretrained(model_checkpoint_path)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.model = LlamaForSampling.from_pretrained(
            model_checkpoint_path,
            batch_size=self.handle.micro_batch_size,
            amp=amp,
            tp_degree=tp_degree,
        )
        logger.info("Starting to compile the model")
        print("loading artifacts.....................")
        self.model.load(f"{model_dir}/neuron_artifacts")
        self.model.to_neuron()
        logger.info("Model has been successfully compiled")
        model_config = AutoConfig.from_pretrained(model_checkpoint_path)
        self.model = HuggingFaceGenerationModelAdapter(model_config, self.model)
        self.output_streamer = TextIteratorStreamerBatch(
            self.tokenizer,
            batch_size=self.handle.micro_batch_size,
            skip_special_tokens=True,
        )

        self.initialized = True

    def preprocess(self, input_text):

        # Ensure the compiled model can handle the input received
        if len(input_text) > self.handle.micro_batch_size:
            raise ValueError(
                f"Model is compiled for batch size {self.handle.micro_batch_size} but received input of size {len(input_text)}"
            )

        # Pad input to match compiled model batch size
        input_text.extend([""] * (self.handle.micro_batch_size - len(input_text)))

        return self.tokenizer(input_text, return_tensors="pt", padding=True)

    def inference(self, tokenized_input):
        print("Inferance")
        generation_kwargs = dict(
            input_ids =tokenized_input,
            streamer=self.output_streamer,
            max_new_tokens=self.max_length,
        )
        self.model.reset_generation()
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        #micro_batch_idx = self.handle.get_micro_batch_idx()
        #micro_batch_req_id_map = self.get_micro_batch_req_id_map(micro_batch_idx)
        for new_text in self.output_streamer:
            logger.debug("send response stream")
            print(new_text)
            # send_intermediate_predict_response(
            #     new_text[: len(micro_batch_req_id_map)],
            #     micro_batch_req_id_map,
            #     "Intermediate Prediction success",
            #     200,
            #     self.context,
            # )

        thread.join()

        #return [""] * len(micro_batch_req_id_map)

    def postprocess(self, inference_output):
        return inference_output

    def get_micro_batch_req_id_map(self, micro_batch_idx: int):
        start_idx = micro_batch_idx * self.handle.micro_batch_size
        micro_batch_req_id_map = {
            index: self.context.request_ids[batch_index]
            for index, batch_index in enumerate(
                range(start_idx, start_idx + self.handle.micro_batch_size)
            )
            if batch_index in self.context.request_ids
        }

        return micro_batch_req_id_map
```

and there was no startup delay

I guess I can modify this and build my own API wrapper around it, but would be nice if I didn't have to rewrite/maintain api code

Answer 4 · 2023-12-11T18:52:14.000Z

Hi enochlev.
You are likely seeing a significant amount of overhead because the context prefix step is not being performed in parallel and is falling back to populating the KV cache token-by-token. When initializing the model you have the option to provide context length estimates for the number of tokens that you expect to see as inputs. By default this is set to be half the size of n_positions.
You can set this field to a single integer or to a list of integers to create models for different input size estimates:

LlamaForSampling.from_pretrained(... context_length_estimate=3000, n_positions=4000)

# Alternative
LlamaForSampling.from_pretrained(... context_length_estimate=[100, 1500, 3000], n_positions=4000)

At inference time, the model will select the context length estimate network that fits all of the input tokens. For any remaining tokens, it will fall back to token-by-token prefill (this is the behavior that you are currently seeing).

Answer 5 · 2023-12-13T20:08:02.000Z

That fixed it

That is the exact behavior I was observing. Input length (not input + output) had to be smaller then the largest number of context_length_estimate for it to start generating instantly.

Answer 6 · 2023-12-15T19:09:54.000Z

enochlev - glad to hear this resolved your issue; closing this ticket for now.