/Alpaca-LoRA-Serve

Demonstrate LLaMA as a Service

Primary LanguagePythonApache License 2.0Apache-2.0

UPDATE

  • Internet search support: you can enable internet search capability in Gradio application and Discord bot. For gradio, there is a internet mode option in the control panel. For discord, you need to specify --internet option in your prompt. For both cases, you need a Serper API Key which you can get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test.
  • Discord Bot support: you can serve any model from the model zoo as Discord Bot. Find how to do this in the instruction section below.

💬🚀 LLM as a Chatbot Service

The purpose of this repository is to let people to use lots of open sourced instruction-following fine-tuned LLM models as a Chatbot service. Because different models behave differently, and different models require differently formmated prompts, I made a very simple library Ping Pong for model agnostic conversation and context managements.

Also, I made GradioChat UI that has a similar shape to HuggingChat but entirely built in Gradio. Those two projects are fully integrated to power this project.

Easiest way to try out ( ✅ Gradio, 🚧 Discord Bot )

Jarvislabs.ai

This project has become the one of the default framework at jarvislabs.ai. Jarvislabs.ai is one of the cloud GPU VM provider with the cheapest GPU prices. Furthermore, all the weights of the supported popular open source LLMs are pre-downloaded. You don't need to waste of your money and time to wait until download hundreds of GBs to try out a collection of LLMs. In less than 10 minutes, you can try out any model.

  • for further instruction how to run Gradio application, please follow the official documentation on the llmchat framework.

dstack

dstack is an open-source tool that allows to run LLM-based apps in a a cloud of your choice via single command. dstack supports AWS, GCP, Azure, Lambda Cloud, etc.

Use the gradio.dstack.yml and discord.dstack.yml configurations to run the Gradio app and Discord bot via dstack.

Instructions

Standalone Gradio app

  1. Prerequisites

    Note that the code only works Python >= 3.9 and gradio >= 3.32.0

    $ conda create -n llm-serve python=3.9
    $ conda activate llm-serve
  2. Install dependencies.

    $ cd LLM-As-Chatbot
    $ pip install -r requirements.txt
  3. Run Gradio application

    There is no required parameter to run the Gradio application. However, there are some small details worth being noted. When --local-files-only is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.

    Hugging Face libraries stores downloaded contents under ~/.cache by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can set HF_HOME environment variable. Find more about the environment variables here

    In order to leverage internet search capability, you need Serper API Key. You can set it manually in the control panel or in CLI. When specifying the Serper API Key in CLI, it will be injected into the corresponding UI control. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test.

    $ python app.py --root-path "" \
                    --local-files-only \
                    --share \
                    --debug \
                    --serper-api-key "YOUR SERPER API KEY"

Discord Bot

  1. Prerequisites

    Note that the code only works Python >= 3.9

    $ conda create -n llm-serve python=3.9
    $ conda activate llm-serve
  2. Install dependencies.

    $ cd LLM-As-Chatbot
    $ pip install -r requirements.txt
  3. Run Discord Bot application. Choose one of the modes in --mode-[cpu|mps|8bit|4bit|full-gpu]. full-gpu will be choseon by default(full means half - consider this as a typo to be fixed later).

    The --token is a required parameter, and you can get it from Discord Developer Portal. If you have not setup Discord Bot from the Discord Developer Portal yet, please follow How to Create a Discord Bot Account section of the tutorial from freeCodeCamp to get the token.

    The --model-name is a required parameter, and you can look around the list of supported models from model_cards.json.

    --max-workers is a parameter to determine how many requests to be handled concurrently. This simply defines the value of the ThreadPoolExecutor.

    When --local-files-only is set, application won't try to look up the Hugging Face Hub(remote). Instead, it will only use the files already downloaded and cached.

    In order to leverage internet search capability, you need Serper API Key. If you don't have it yet, please get one from serper.dev. By signing up, you will get free 2,500 free google searches which is pretty much sufficient for a long-term test. Once you have the Serper API Key, you can specify it in --serper-api-key option.

    • Hugging Face libraries stores downloaded contents under ~/.cache by default, and this application assumes so. However, if you downloaded weights in different location for some reasons, you can set HF_HOME environment variable. Find more about the environment variables here
    $ python discord_app.py --token "DISCORD BOT TOKEN" \
                            --model-name "alpaca-lora-7b" \
                            --max-workers 1 \
                            --mode-[cpu|mps|8bit|4bit|full-gpu] \
                            --local_files_only \
                            --serper-api-key "YOUR SERPER API KEY"
  4. Supported Discord Bot commands

    There is no slash commands. The only way to interact with the deployed discord bot is to mention the bot. However, you can pass some special strings while mentioning the bot.

    • @bot_name help: it will display a simple help message
    • @bot_name model-info: it will display the information of the currently selected(deployed) model from the model_cards.json.
    • @bot_name default-params: it will display the default parameters to be used in model's generate method. That is GenerationConfig, and it holds parameters such as temperature, top_p, and so on.
    • @bot_name user message --max-new-tokens 512 --temperature 0.9 --top-p 0.75 --do_sample --max-windows 5 --internet: all parameters are used to dynamically determine the text geneartion behaviour as in GenerationConfig except max-windows. The max-windows determines how many past conversations to look up as a reference. The default value is set to 3, but as the conversation goes long, you can increase this value. --internet will try to answer to your prompt by aggregating information scraped from google search. To use --internet option, you need to specify --serper-api-key when booting up the program.

Context management

Different model might have different strategies to manage context, so if you want to know the exact strategies applied to each model, take a look at the chats directory. However, here are the basic ideas that I have come up with initially. I have found long prompts will slow down the generation process a lot eventually, so I thought the prompts should be kept as short as possible while as concise as possible at the same time. In the previous version, I have accumulated all the past conversations, and that didn't go well.

  • In every turn of the conversation, the past N conversations will be kept. Think about the N as a hyper-parameter. As an experiment, currently the past 2-3 conversations are only kept for all models.

Currently supported models

Checkout the list of models

Todos

  • Gradio components to control the configurations of the generation
  • Multiple conversation management
  • Internet search capability (by integrating ChromaDB, intfloat/e5-large-v2)
  • Implement server only option w/ FastAPI

Acknowledgements

  • I am thankful to Jarvislabs.ai who generously provided free GPU resources to experiment with Alpaca-LoRA deployment and share it to communities to try out.
  • I am thankful to AI Network who generously provided A100(40G) x 8 DGX workstation for fine-tuning and serving the models.