deep-diver/LLM-As-Chatbot

query regarding MPT, RedPajama and Falcon models

GeorvityLabs opened this issue · 17 comments

Hey @deep-diver ,

is it possible to load
mpt-7b-chat
redpajama-7b-chat
falcon-7b-instruct

in 8 Bit ?

Have you tried loading these models in 8 Bit.
If so , how did you do it?

Are they supported for 8 bit inference using bitsandbytes?
if so , could you share an example implementation/configuration of loading these models in 8 bit.

I think they are supported.
I have only tested MPT and Flcon models.

Just uncheck Multi GPU optoin, then it will load up the model in load_in_8bits=True option

@deep-diver ,
how can i update the following code to enable 8bit inference.

model_name = "tiiuae/falcon-7b-instruct"
print(f"Starting to load the model into memory")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map = "auto"
)

tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

I did try adding load_in_8bits = True , but it just creates some errors.

what kind of error?

@deep-diver this is the error that I'm getting :

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Exception in thread Thread-6 (generate_and_signal_complete):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/kaggle/working/falcon-chat/app.py", line 102, in generate_and_signal_complete
    m.generate(**generate_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1565, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2648, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

my project works with the following code

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

def load_model(base, finetuned, multi_gpu, force_download_ckpt):
    tokenizer = AutoTokenizer.from_pretrained(base)
    tokenizer.padding_side = "left"

    model = AutoModelForCausalLM.from_pretrained(
        base,
        load_in_8bit=False if multi_gpu else True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True
    )    

    return model, tokenizer

here, I found torch_dtype should be set to torch.bfloat16. Also, please refer to the GenerationConfig that I set here

@deep-diver , i made a kaggle notebook running on P100.
it is having the same issue , I followed ur configuration.

hope you can check my notebook and let me know why i'm getting the error :
https://www.kaggle.com/johnnycole5/falcon-7b-inference

this is the code I used :

import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

import transformers

base = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(base)
tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    base,
    load_in_8bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

sequences = pipeline(
   "Draft an apology email to a customer who experienced a delay in their order and provide reassurance that the issue has been resolved",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

In zcolab I needed to completely reatart the kernel not just reset runtime

@deep-diver also I had a request for you.
Could you create a colab notebook in which we can run LLM as Chatbot (all the models which can run on free GPU)

It would be great if we had an seperate project which contains models which can be run in free colab GPU.
Including loading in 8bit.

In zcolab I needed to completely reatart the kernel not just reset runtime

I'll try resetting runtime after installating everything using pip and see if it works.

In zcolab I needed to completely reatart the kernel not just reset runtime

getting the same error as below, even after restarting kernel.
Screenshot from 2023-05-31 14-19-42

What PyTorch version do you use?

What PyTorch version do you use?

@deep-diver i use torch 2.0.1

also i just cloned llm-as-chatbot repo, it successfully downloads the repo.
but when i try to run the model , i get the following error :

File "/kaggle/working/LLM-As-Chatbot/chats/falcon.py", line 87, in chat_stream
    for ppmanager, uis in text_stream(ppm, streamer):
  File "/kaggle/working/LLM-As-Chatbot/chats/falcon.py", line 31, in text_stream
    for new_text in streamer:
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/streamers.py", line 223, in __next__
    value = self.text_queue.get(timeout=self.timeout)
  File "/opt/conda/lib/python3.10/queue.py", line 179, in get
    raise Empty
_queue.Empty
^C
Keyboard interruption in main thread... closing server.
Killing tunnel 0.0.0.0:6006 <> https://3cbb19d345b6ffb752.gradio.live/

Do you have a colab notebook link for llm-as-a-chatbot? @deep-diver

@deep-diver this is the error i'm getting when running falcon 7b in llm as chatbot

/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1255: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Exception in thread Thread-6 (generate):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1568, in generate
    return self.sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2656, in sample
    raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
ValueError: If `eos_token_id` is defined, make sure that `pad_token_id` is defined.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 336, in async_iteration
    return await iterator.__anext__()
  File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 329, in __anext__
    return await anyio.to_thread.run_sync(
  File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 312, in run_sync_iterator_async
    return next(iterator)
  File "/kaggle/working/LLM-As-Chatbot/chats/central.py", line 184, in chat_stream
    for idx, x in enumerate(cs):
  File "/kaggle/working/LLM-As-Chatbot/chats/falcon.py", line 87, in chat_stream
    for ppmanager, uis in text_stream(ppm, streamer):
  File "/kaggle/working/LLM-As-Chatbot/chats/falcon.py", line 31, in text_stream
    for new_text in streamer:
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/streamers.py", line 223, in __next__
    value = self.text_queue.get(timeout=self.timeout)
  File "/opt/conda/lib/python3.10/queue.py", line 179, in get
    raise Empty
_queue.Empty

everything seems to be working fine with me, even on the Kaggle notebook w/ P100 (see the link below). However, LLM As Chatbot would not work on Colab since the temporary connection is not stable between Gradio and Colab

https://www.kaggle.com/code/tomcspark/falcon-7b-inference

_queue.Empty normally happens if GenerationConfig is not appropriately set. Choose the right one by looking up the names between model and yaml. I will update later to automatically select the right one