/FastChat-TPU

An open platform for serving, and evaluating large language models.

Primary LanguagePythonApache License 2.0Apache-2.0

FastChat-TPU

Original repository address FastChat

demo

Install

Install Package

pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"

Serving with Web GUI

To serve using the web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the webserver and model workers. You can learn more about the architecture here.

Here are the commands to follow in your terminal:

Launch the controller

python3 -m fastchat.serve.controller

This controller manages the distributed workers.

Launch the model worker(s)

python3 -m fastchat.serve.model_worker --model-path ./bmodel/chatglm3-6b --device tpu --dev_id 0

Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller .

To ensure that your model worker is connected to your controller properly, send a test message using the following command:

python3 -m fastchat.serve.test_message --model-path ./bmodel/chatglm3-6b --model-name chatglm3-6b --device tpu --dev_id 0

You will see a short output.

Launch the Gradio web server

python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI. You can open your browser and chat with a model now. If the models do not show up, try to reboot the gradio web server.

(Optional): Advanced Features, Scalability, Third Party UI

  • You can register multiple model workers to a single controller, which can be used for serving a single model with higher throughput or serving multiple models at the same time. When doing so, please allocate different GPUs and ports for different model workers.
# worker 0
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 --controller http://localhost:21001 --port 31000 --worker http://localhost:31000
# worker 1
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker --model-path lmsys/fastchat-t5-3b-v1.0 --controller http://localhost:21001 --port 31001 --worker http://localhost:31001
  • You can also launch a multi-tab gradio server, which includes the Chatbot Arena tabs.
python3 -m fastchat.serve.gradio_web_server_multi
  • The default model worker based on huggingface/transformers has great compatibility but can be slow. If you want high-throughput batched serving, you can try vLLM integration.
  • If you want to host it on your own UI or third party UI, see Third Party UI.

API

OpenAI-Compatible RESTful APIs & SDK

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. The FastChat server is compatible with both openai-python library and cURL commands. The REST API is capable of being executed from Google Colab free tier, as demonstrated in the FastChat_API_GoogleColab.ipynb notebook, available in our repository. See docs/openai_api.md.

Hugging Face Generation APIs

See fastchat/serve/huggingface_api.py.

LangChain Integration

See docs/langchain_integration.