dottxt-ai/outlines

VLLM Integration + Outlines fails on CFG

w013nad opened this issue · 0 comments

Describe the issue as clearly as possible:

With the updates to CFG I wanted to test out the integration with VLLM.

I pulled the latest git repo of outlines as of today (9/5/2024) and did a pip install . on a docker image with vllm==0.6.0.

I'm not sure on the current status of the vllm integration but long story short, it crashes the whole server.

Steps/code to reproduce the bug:

from openai import OpenAI
import lark
model = 'http://10.72.5.190:15001/v1'

client = OpenAI(
    api_key="EMPTY",
    base_url=model,
)
model_name = client.models.list().data[0].id # Grab the model name from the API
grammar_string = r"""
    start: sentence
    %import common.WS
    sentence: noun WS verb WS noun        -> simple

    noun: /[A-Za-z]+/  # match one or more letters (a general noun)
    verb: /[A-Za-z]+/  # match one or more letters (a general verb)

    
    # %ignore WS
"""
parser = lark.Lark(grammar_string)

test_sentences = [
    "The dog ran quickly accross the field.",
    "The duck goes to the park.",
    "The chicken crosses the road.",
    "The cat eats the food.",
    "The dog runs around the corner.",
    "The baby laughs at the clown.",
    "The teacher writes on the board.",
    "The student reads the book.",
    "The car drives down the street.",
    "The flowers bloom in the garden.",
    "The musician plays the guitar.",
    "The athlete wins the game.",
    "The tourist visits the museum.",
    "The chef cooks the meal.",
    "The doctor examines the patient.",
    "The engineer builds the bridge.",
]

for sentence in test_sentences:
    prompt = f"""
Convert the following sentence into a sentence that follows the following grammar:
"noun verb noun"

Sentence: "{sentence}"

Only return the transformed sentence with no explanations.
"""
    messages = [{"role": "user", "content": prompt}]

    output = client.chat.completions.create(
            model=model_name,  # Model name to use
            messages=messages,  # Chat history
            max_tokens=50,
            extra_body={
                'guided_grammar':grammar_string
            })

    print('With Guided Decoding')
    print(output.choices[0].message.content)

Expected result:

Dog ran field.
Duck goes park.
Chicken crosses road.
....

Error message:

INFO 09-05 18:03:53 async_llm_engine.py:206] Added request chat-e6a37c28c21f43f6904a809b60aadcf5.
ERROR 09-05 18:03:53 async_llm_engine.py:63] Engine background task failed
ERROR 09-05 18:03:53 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-05 18:03:53 async_llm_engine.py:63]     result = task.result()
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-05 18:03:53 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-05 18:03:53 async_llm_engine.py:63]     output = await self.model_executor.execute_model_async(
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return await self.driver_exec_model(execute_model_req)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 09-05 18:03:53 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-05 18:03:53 async_llm_engine.py:63]     output = self.model_runner.execute_model(
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return func(*args, **kwargs)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1483, in execute_model
ERROR 09-05 18:03:53 async_llm_engine.py:63]     logits = self.model.compute_logits(hidden_or_intermediate_states,
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 438, in compute_logits
ERROR 09-05 18:03:53 async_llm_engine.py:63]     logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return self._call_impl(*args, **kwargs)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-05 18:03:53 async_llm_engine.py:63]     return forward_call(*args, **kwargs)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 72, in forward
ERROR 09-05 18:03:53 async_llm_engine.py:63]     logits = _apply_logits_processors(logits, sampling_metadata)
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 142, in _apply_logits_processors
ERROR 09-05 18:03:53 async_llm_engine.py:63]     logits_row = logits_processor(past_tokens_ids,
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 67, in __call__
ERROR 09-05 18:03:53 async_llm_engine.py:63]     instruction = self._guide.get_next_instruction(
ERROR 09-05 18:03:53 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/guide.py", line 362, in get_next_instruction
ERROR 09-05 18:03:53 async_llm_engine.py:63]     if state.parser_state is None:
ERROR 09-05 18:03:53 async_llm_engine.py:63] AttributeError: 'int' object has no attribute 'parser_state'

Outlines/Python version information:

Version information

``` root@4da9526c2038:/home/ndurkee# python3 -c "from outlines import _version; print(_version.version)" 0.0.47.dev69+g72377db.d20240906 root@4da9526c2038:/home/ndurkee# python3 -c "import sys; print('Python', sys.version)" Python 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] root@4da9526c2038:/home/ndurkee# pip freeze accelerate==0.34.0 aiohappyeyeballs==2.4.0 aiohttp==3.10.5 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 async-timeout==4.0.3 attrs==24.2.0 certifi==2019.11.28 chardet==3.0.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 datasets==2.21.0 dbus-python==1.2.16 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 distro-info==0.23+ubuntu1.1 exceptiongroup==1.2.2 fastapi==0.112.2 filelock==3.15.4 flashinfer @ https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp310-cp310-linux_x86_64.whl#sha256=d7605fbe3f14ef7f36e702f627c1f06e5a32495b5ebfe34313c3fb15f3e4eb06 frozenlist==1.4.1 fsspec==2024.6.1 gguf==0.9.1 h11==0.14.0 hf_transfer==0.1.8 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 huggingface-hub==0.24.6 idna==2.8 importlib_metadata==8.4.0 interegular==0.3.3 Jinja2==3.1.4 jiter==0.5.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 lark==1.2.2 llvmlite==0.43.0 lm-format-enforcer==0.10.6 MarkupSafe==2.1.5 mistral_common==1.3.4 modelscope==1.17.1 mpmath==1.3.0 msgpack==1.0.8 msgspec==0.18.6 multidict==6.0.5 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.3 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.68 nvidia-nvtx-cu12==12.1.105 openai==1.43.0 outlines @ file:///home/ndurkee/outlines.zip#sha256=bd3a3782c596af0c846f17541bc91ad36b06b47022155baad8dc0009cba98131 packaging==24.1 pandas==2.2.2 partial-json-parser==0.2.1.1.post4 pillow==10.4.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 protobuf==5.28.0 psutil==6.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pycountry==24.6.1 pydantic==2.8.2 pydantic_core==2.20.1 PyGObject==3.36.0 python-apt==2.0.1+ubuntu0.20.4.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.1 PyYAML==6.0.2 pyzmq==26.2.0 ray==2.35.0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 requests-unixsocket==0.2.0 rpds-py==0.20.0 safetensors==0.4.4 sentencepiece==0.2.0 six==1.14.0 sniffio==1.3.1 starlette==0.38.4 sympy==1.13.2 tiktoken==0.7.0 tokenizers==0.19.1 torch==2.4.0 torchvision==0.19.0 tqdm==4.66.5 transformers==4.44.2 triton==3.0.0 typing_extensions==4.12.2 tzdata==2024.1 unattended-upgrades==0.1 urllib3==2.2.2 uvicorn==0.30.6 uvloop==0.20.0 vllm @ file:///vllm-workspace/dist/vllm-0.6.0-cp38-abi3-linux_x86_64.whl#sha256=7544ea84033999d0093a8829f7d84d556a47f41f2eb7ff579478458bfc08f2c7 vllm-flash-attn==2.6.1 watchfiles==0.24.0 websockets==13.0.1 xformers==0.0.27.post2 xxhash==3.5.0 yarl==1.9.9 zipp==3.20.1 ```

Context for the issue:

We're trying to get language models to output sentences in a particular format while still using our main production API with vLLM. It's cost prohibitive to host multiple APIs using different LLMs so it's better if we can do everything through the same API.