Run large language models in AWS. The setup is as follows:
- CDK scripts to generate the AWS infrastructure
- The application itself is two tiered:
- A small server for managing requests and sending tasks to a machine with a GPU
- A worker for doing actual model predictions. It is intended to be run on spot instances.
To run a local server you will need Docker compose installed. Then do
make run-server
Once the server is up, you can submit a sample prompt by doing
curl --header "Content-Type: application/json" \
--request POST \
--data '{"prompt": "hello, world!"}' \
localhost:8000/completions
You can also run
make run-server-chaotic
This will continually ping the server while randomly restarting the worker in the background.
You can run lint checks by doing
make lint