A demo on how LLMs performance is impacted by the use of GPUs.
Based off example from modal-example
The project was developed using PDM. It's the recommended way to get the project setup. See below for how to install PDM.
Make sure you have Python 3.9 installed.
git clone git@github.com:murilo-cunha/latency.git
cd latency/
pdm install
PDM will take care of installing the dependencies and virtual environments.
Alternatively, we've exported a requirements.txt
file for convenience. Using Python's built in venv
:
git clone git@github.com:murilo-cunha/latency.git
cd latency/
python --version # make sure you are using python 3.9
python -m venv .venv
source .venv/bin/activate # alternatively, `.venv\Scripts\activate.bat` if you're on Windows
pip install -r requirements.txt
If you have set up using PDM:
pdm run local # local execution
pdm run remote # cloud/remote execution
If using venv
:
source .venv/bin/activate # alternatively, `.venv\Scripts\activate.bat` if you're on Windows
python scripts/local.py # local execution
modal run scripts/remote.py # cloud/remote execution
Easiest way is to install using pipx
. See their docs for other installation methods. See pipx
docs for more info about pipx
and how to install it.
pipx install pdm==2.5.3
.
├── .gitignore
├── .models/
├── .pre-commit-config.yaml
├── README.md
├── common/
│ ├── __init__.py
│ └── utils.py
├── pdm.lock
├── pyproject.toml
└── scripts/
├── local.py
└── remote.py
Directories:
.models/
- an (empty) directory to hold LLM modelscommon/
- a local python package with the common building blocks used both for local and remote executionscripts/
- entrypoint scripts; demos are ran from these scripts
Files:
.gitignore
- list of patterns to not be committed to the repo.pre-commit-config.yaml
- list of pre-commit hooksREADME.md
- general project informationpdm.lock
- PDM's lockfile with dependency tree and versionspyproject.toml
- definition of project; includes dependencies, configuration and scripts
The demo uses Modal for GPU use. Modal is a simple cloud service for creating serverless applications.
Signing up to Modal is as easy as linking your GitHub account. After a few hours you should receive a notification about you account being created. Once the account is created, create a token using
model token new
Modal is free to sign up, not requiring any credit card. Once signed up, it offers a free tier of $30/month. The current month's usage is transparently shown in under Settings
> Usage and Billing
.
It's also a Python-only API to define your dependencies, including container dependencies, replacing the need for Dockerfiles
, etc. It offers different serverless applications, including functions, schedulers or REST APIs. Check our their docs and examples to see more.
If you've ran the demo and would like to get your hands dirty, I'd recommend:
- [ ]: Go over Modal's starting examples to better understand how it works
- [ ]: Modify the cloud demo to create a FastAPI REST endpoint to have a true experience on "latency in model serving" (via REST API)
- See the original example for more info
- See the web endpoints documentation
You should be able to call your REST application running:
curl $MODEL_APP_ENDPOINT \ # specify your endpoint here
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate a list of 20 great names for sentient cheesecakes that teach SQL",
"stream": false,
"max_tokens": 64
}'