AI Agent with Vicuna
This repository is a playground that makes it easy to use Zero Shot / Few Shot prompts based on the ReAct framework, with LLama based models, such as Vicuna, with the langchain framework.
Disclaimer: This may not be the most effective way to install, but it's how I've done. It's possible that the installation process is not working as expected, and you might have to install additional requirements. If that's the case, feel free to open a PR to fix it or an issue about it.
First, install the NVIDIA toolkit: https://developer.nvidia.com/cuda-11-8-0-download-archive
In my case, I installed using the deb repository, which downloaded a CUDA toolkit 12.1
If you don't have a GPU, you should skip this, of course.
Run, in your bash, run:
chmod +x ./install_on_virtualenv_and_pip.sh
./install_on_virtualenv_and_pip.sh`
If you're not running bash, and you don't have a virtualenv, or created it another way, you can execute the commands, adapting to your OS/shell:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install -r requirements.txt
Just note that might clobber your main Python installation, if possible, use a virtualenv!
For the quantized models, you also need git lfs installed: https://git-lfs.com/
This is useful when:
This is the recommended approach for most users. You should only use option 2 if you need prompt logging.
With this approach you can load quantized models (see how to run the newest models as described here: paolorechia#24)
You can also change any of the available parameters:
def default_parameters():
return {
"max_new_tokens": 250,
"do_sample": True,
"temperature": 0.001,
"top_p": 0.1,
"typical_p": 1,
"repetition_penalty": 1.2,
"top_k": 1,
"min_length": 32,
"no_repeat_ngram_size": 0,
"num_beams": 1,
"penalty_alpha": 0,
"length_penalty": 1,
"early_stopping": False,
"seed": -1,
"add_bos_token": True,
"truncation_length": 2048,
"ban_eos_token": False,
"skip_special_tokens": True,
"stopping_strings": [STOP_TOKEN + ":"],
}
If you define your own parameters, you should pass them in the LLM builder function:
def build_text_generation_web_ui_client_llm(
prompt_url="http://localhost:5000/api/v1/generate", parameters=None # override these parameters
):
if parameters is None:
parameters = default_parameters()
return HTTPBaseLLM(
prompt_url=prompt_url,
parameters=parameters,
stop_parameter_name="stopping_strings",
response_extractor=response_extractor,
)
Also make sure to update the response_extractor
if you modify the stopping_strings
parameter, or else things will break in unexpected ways.
Steps to install:
- Use https://github.com/oobabooga/text-generation-webui as the backend.
- Install the text-generation-webui as instructed in the repository README,
- Download a model and start the server / UI.
- In the UI, go to Interface Mode -> Available Extensions -> api (tick on this one). Click on Apply and restart the interface.
Some code samples will not work out of the box with this option. To use the Text Generation WebUI you should use the correct LLM client:
from langchain_app.models.text_generation_web_ui import build_text_generation_web_ui_client_llm
llm = build_text_generation_web_ui_client_llm()
You can see Chuck Norris example using it here: langchain_app/agents/chuck_norris_test_web_generation_textui.py
To execute it and test it:
python3 -m langchain_app.agents.chuck_norris_test_web_generation_textui
Update 13.05.2023: I don't recommend this option at the moment, as it seems there are some open bugs with the quantized version, and I'm not planning to fix them anytime soon. Please use the text generation ui instead if you want to use quantized models. Open bugs:
This option only makes sense if you want to use my server prompt logging feature to generate datasets. Currently only working with HF models (at most 8 bits quantization).
If you have the virtualenv, start it by running: source learn-langchain/bin/activate
Then:
uvicorn servers.vicuna_server:app
If you want to load a different model than the default:
export USE_13B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app
If you've somehow managed to install everything on Windows (congrats!), please feel free to contribute to extend this README.md. So far we know that you need to follow this change:
Your command to use a different model returns an error if you try and run it on windows:
export USE_13B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app
The export and && isn't supported. I found that you can use the set command and format it like this:
set USE_13B_MODEL=true; set USE_4BIT=true;uvicorn servers.vicuna_server:app
Thanks to @unoriginalscreenname for sharing
When you run it for the first time, the server might throw you an error that the model is not found. You should follow the instruction, for instance, cloning:
git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g
Make sure you have installed git lfs
first. You can also run this command beforehand, if you know the version you want to use.
This repository has again been reorganized, adding support for 4 bit models (from gpqt_for_llama: https://github.com/qwopqwop200/GPTQ-for-LLaMa)
You can now change the server behavior by setting environment variables:
class Config:
def __init__(self) -> None:
self.base_model_size = "13b" if os.getenv("USE_13B_MODEL") else "7b"
self.use_for_4bit = True if os.getenv("USE_FOR_4BIT") else False
self.use_fine_tuned_lora = True if os.getenv("USE_FINE_TUNED_LORA") else False
self.lora_weights = os.getenv("LORA_WEIGHTS")
self.device = "cpu" if os.getenv("USE_CPU") else "cuda"
self.model_path = os.getenv("MODEL_PATH", "")
self.model_checkpoint = os.getenv("MODEL_CHECKPOINT", "")
Some options are incompatible with each other, the code does not check for all possibilities.
This repository's web server support the following models:
- Vicuna 7b unquantized, HF format (16-bits) - this is the default (https://huggingface.co/eachadea/vicuna-7b-1.1)
- Vicuna 7b LoRA fine-tune (8-bits)
- Vicuna 7b GPQT 4-bit group-size 128
- Vicuna 13b unquantized, HF format (16-bits)
- Vicuna 13b GPQT 4-bit group-size 128
Note: The coding prompts are currently unreliable, and what works with one model might not work with another. Changing the model parameters also greatly affect the output. For instance, increasing repetition penalty to 1.3 seems to improve the quality of the output code, when using the WizardLM model.
Using this to start server:
export USE_4BIT=true
export USE_13B_MODEL=true
uvicorn servers.vicuna_server:app
You can get this output:
(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$ python3 -m langchain_app.agents.cat_joke
> Entering new AgentExecutor chain...
I must use the Python REPL to write a script that generates cat jokes and saves them to a CSV file called 'catjokes.csv'.
Action: Python REPL
Action Input:
import csv # line 1
jokes = ["Why did the cat go to the vet?", "Because it was meowt!"] # line 2
with open("catjokes.csv", "w", newline="") as csvfile: # line 3
writer = csv.writer(csvfile) # line 4
writer.writerow(jokes) # line 5
Observation:
Thought:I now know the answer
Final Answer: I have executed the task successfully.
Final Answer: I have executed the task successfully.
> Finished chain.
(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$ cat catjokes.csv
Why did the cat go to the vet?,Because it was meowt!
Note: The original prompt from the article is not working anymore, as I might have modified something in the process.
The model that seems to perform better at this sort of task is the new WizardLM: https://huggingface.co/TheBloke/wizardLM-7B-HF
Server start. Modify the file 'run_server.sh' with this:
export MODEL_PATH=TheBloke/wizardLM-7B-HF
uvicorn servers.vicuna_server:app
> Entering new AgentExecutor chain...
I should use requests library to fetch the data from the API
Action: Python REPL
Action Input:
import requests
Observation:
Thought:I should define the endpoint URL
Action: Python REPL
Action Input:
url = 'https://api.chucknorris.io/jokes/random'
Observation:
Thought:I should send a GET request to the endpoint
Action: Python REPL
Action Input:
response = requests.get(url)
Observation:
Thought:I should check if the response status code is 200
Action: Python REPL
Action Input:
if response.status_code == 200:
# extract the joke from the response
data = response.json()
joke = data['value']
print(joke)
else:
print('Error fetching joke')
Observation: Chuck Norris once ran a one-minute mile. He did it dragging an 18-wheeler while running in a field of wet cement.
Thought:I have successfully fetched and printed the joke.
Final Answer: Chuck Norris once ran a one-minute mile. He did it dragging an 18-wheeler while running in a field of wet cement.
> Finished chain.
(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$
python -m langchain_app.agents.answer_about_germany
Sample output:
Type your question: Where is Germany Located?
> Entering new AgentExecutor chain...
I should always think about what to do
Action: Search
Action Input: Germany
Observation: [(Document(page_content="'''Germany''',{{efn|{{lang-de|Deutschland}}, {{IPA-de|ˈdɔʏtʃlant|pron|De-Deutschland.ogg}}}} officially the '''Federal Republic of Germany''',{{efn|{{Lang-de|Bundesrepublik Deutschland}}, {{IPA-de|ˈbʊndəsʁepuˌbliːk ˈdɔʏtʃlant|pron|De-Bundesrepublik Deutschland.ogg}}<ref>{{cite book|title=Duden, Aussprachewörterbuch|publisher=Dudenverlag|year=2005|isbn=978-3-411-04066-7|editor-last=Mangold|editor-first=Max|edition=6th|pages=271, 53f|language=de}}</ref>}} is a country in [[Central Europe]]. It is the [[List of European countries by population|second-most populous country]] in Europe after [[Russia]], and the most populous [[member state of the European Union]]. Germany is situated between the [[Baltic Sea|Baltic]] and [[North Sea|North]] seas to the north, and the [[Alps]] to the south. Its 16 [[States of Germany|constituent states]] are bordered by [[Denmark]] to the north, [[Poland]] and the [[Czech Republic]] to the east, [[Austria]] and [[Switzerland]] to the south, and [[France]], [[Luxembourg]], [[Belgium]], and the [[Netherlands]] to the west. The nation's capital and [[List of cities in Germany by population|most populous city]] is [[Berlin]] and its main financial centre is [[Frankfurt]]; the largest urban area is the [[Ruhr]].", metadata={'source': '1'}), 0.8264833092689514)]
Thought:I now know the final answer
Final Answer: Germany is a country located in Central Europe, bordered by the Baltic and North seas to the north, the Alps to the south, and bordered by Denmark, Poland, the Czech Republic, Austria, Switzerland, France, Luxembourg, Belgium, and the Netherlands to the west. Its capital and most populous city is Berlin and its main financial center is Frankfurt. The largest urban area is the Ruhr.
Nothign is guaranteed to work here.
This is an experimental task executor I've been developing at: https://github.com/paolorechia/code-it
You can try it using this example:
python -m langchain_app.agents.coder_plot_chart_executor_test
I'll explain this better at some point :)
Install matplotlib
and run:
python -m langchain_app.agents.coder_plot_chart
If all works well, the bot will create a example_chart.png
file and a persistent_source.py
with the source code generated.
https://medium.com/@paolorechia/creating-my-first-ai-agent-with-vicuna-and-langchain-376ed77160e3
https://medium.com/@paolorechia/fine-tuning-my-first-wizardlm-lora-ca75aa35363d
https://gist.github.com/paolorechia/0b8b5e08b38040e7ec10eef237caf3a5
- The quantized models are a direct copy of this repository: https://github.com/qwopqwop200/GPTQ-for-LLaMa
- The inference code is slightly modified version of FastChat: https://github.com/lm-sys/FastChat
- Thanks to oobagooba's contributors (https://github.com/oobabooga/text-generation-webui) for creating an easy to use backend.
- All model creators / community that are training and sharing models