This repo aims to help you spin up resources and get started with open source LLMs.
It contains:
- Infra to provision a GPU instance on GCP, as many LLMs are too big to run on a CPU
- Instructions to install basic requirements
- Kickstarter notebook for loading and running open source LLMs
- Streamlit app to allow you to chat with your LLM!
- Login to GCP
gcloud auth login
- Select and set your GCP project
gcloud config set project {project}
- You can run the following to find your personal project ID:
gcloud projects list
- If you don't have a project, you can create a new one:
gcloud projects create {project}
- Open the
/infra
directory - Within the
terraform.tfvars
file, set the following variables:author
: Your nameproject
: Your GCP project ID, i.e.{project}
from above- Optionally set other variables (see
variables.tf
) in theterraform.tfvars
file
- Initialize terraform:
terraform init
- Check the plan with:
terraform plan
- Spin up the infra:
terraform apply
- A GPU instance should now be ready for you. You should be able to see it here:
https://console.cloud.google.com/compute/instances?project={project}
Note: The instance we provision is a
n1-standard-8
with aNVIDIA Tesla V100
GPU. This costs about €2.30 per hour. It's configured to stop after 1 hour of idle time, but please be mindful of the costs, and still shut it down manually if you're not using it. You can change the compute type interraform.tfvars
, and find more pricing info here.
- Configure ssh for the compute instance.
gcloud compute config-ssh
Note: It will take a few minutes before the instance is ready to accept ssh connections. Wait a few minutes, and try again if the next step doesn't work.
- Connect with the instance as prompted
ssh {author}-instance.{zone}.{project}
- In VSCode, you can also connect to the instance by:
- Opening the command palette (Command+Shift+P)
- Select "Remote-SSH: Connect to Host..."
- Select your compute instance
{author}-instance.{zone}.{project}
git clone https://github.com/godatadriven/openllm-starter.git
And open the project folder: cd openllm-starter
- Create a new conda environment
conda create --name {env_name} python=3.10
- And activate it
conda activate {env_name}
- Install the requirements
pip install -r requirements.txt
Note: There are probably more elegant ways to manage the dependencies, but I've struggled to get it to work smoothly with conda/poetry and GPUs. So for now, this is the easiest way I've found.
- Install some dependencies depending on which editor you use:
a. VSCode
code --install-extension ms-python.python
code --install-extension ms-toolsai.jupyter
code --install-extension github.copilot # optional
b. Jupyter
pip install notebook
python -m ipykernel install --user # install the kernel in jupyter
To make sure we can inspect streamlit apps in a browser on our local machine, we need to forward the ports from the VM to our local machine. To do so, run the following on your local machine:
gcloud compute ssh {author}-instance --project {project} --zone {zone} -- -L 8501:localhost:8501 -L 8888:localhost:8888
You can find the author, project, and zone in terraform.tfvars
or variables.tf
- (If you're working with Jupyter) Launch Jupyter with
jupyter notebook
- Open
llms.ipynb
- Select the
{env_name}
kernel - Have fun!
-
Make sure you've forwarded the ports (see above)
-
Run the app on the instance:
streamlit run app.py
- Open the app in your browser: http://localhost:8501
- Change the
load_model
andpredict
functions (and more) and have fun!
-
The instance is configured to stop after 1 hour of inactivity. But you can also stop it manually here:
https://console.cloud.google.com/compute/instances?project={project} -
When you fully want to clean up, run (on your local machine):
terraform destroy
Note: When deleting the instance with
terraform destroy
, your data will be lost. So make sure to save your work before doing so.
While it becomes increasingly feasible to run your own small (~1 billion param) small models as described above, there are also more and more tutorials becoming available to run and deploy the biggest models of them all.
One example of such a model is the falcon-40b, which is among the top 10 performing models in the open LLM leaderboard at the time of writing.
To deploy this model in your own (AWS) infrastructure, we recommend to follow the following tutorial: falcon-40b-accelerate.ipynb
Provided you have the required quota for provisioning a g5.12xlarge machine (which you can request otherwise), you can deploy this model within 30 minutes by just running the notebook in a Sagemaker instance.
Some useful things to keep in mind:
- deployment takes about 15-30 minutes
- inference takes about 35 seconds, which is not yet ideal
- the g5.12xlarge costs about €5.60 per hour, so make sure to clean up afterwards