This repo automates the setup of Ray on Snowpark Container Services. It also loads the Vicuna13B model, which has a context length of 16K tokens as a Ray Serve API on SPCS. This model needs a lot of GPU memory while inferencing, which cannot be served on easily available GPU infrastructure. The 8 GPUs are not coming from a single expensive GPU_10 node, infact, there are two smaller GPU7 nodes, each having 4 GPUs, which are making a Ray cluster connected to a single GPU3 Ray head node. This way, we can scale with smaller instances with GPUs whenever we have high GPU memory needs, while also being cost-efficient. This distributed setup will be 1/2 the cost of a single GPU_10.
In the GIF below, the 6340 word prompt from sample_prompt.txt
is being fed into the Ray Serve API deployed on SPCS, with 8 GPUs working in parallel. As a reminder, this model accepts upto 16K tokens and this setup will process the prompt with multiple GPUs in parallel.
The following screenshot shows a static image of the GPUs in action.
The Ray Serve API on SPCS is shown below.
The Ray open source community and the managed Ray offering, Anyscale, have published a plethora of blog posts on why Ray makes sense for distributing workloads. The following table highlights a few of the posts that made me fall in love with Ray, and motivated me to bring Ray into SPCS.
Area | Topic | Context |
---|---|---|
Data Engineering | Modin with Ray | On Oct 23, Snowflake announced the intent to acquire Ponder, which will boost Python capabilities in the Data Cloud. As mentioned in the Snowflake blog, Ponder maintains Modin, a widely-used open-source library for scalable Pandas operations. Modin is able to leverage Ray on top of SPCS to distribute pandas-based operations. For more details on this topic, see here. I tested open source Modin on Ray within SPCS and saw nice performance improvements compared to pandas, but that’s a blog topic for another day. |
AI / ML | Deep Learning batch inference | In this post from the Anyscale team, benchmarking was performed on deep learning batch inferencing on Ray vs Spark (Databricks runtime v12.0, with Apache Spark 3.3.1). IRay outperformed Spark by 2x in a Spark single cluster setup, and 3.2x in a Spark multi-cluster setup. |
AI / ML | Distributed Model Training and Hyperparameter Tuning | Ray enables data scientists to perform distributed model training and hyperparameter tuning. For more details, see here and here. |
LLM | Serving LLM APIs on Ray Serve | With Ray Serve continuous batching and vLLM, this post shows how LLM inference can be 23x faster with reduced p50 latency by using continuous batching functionality in Ray Serve, combined with vLLM. |
LLM | Parallel fine tuning of LLMs | With Ray TorchTrainer, LLM practitioners can perform parallel fine-tuning (a full parameter or LoRA) on open source LLMs. See here for an example where Llama-2 fine tuning (7B, 13B or 70B) is demonstrated using TorchTrainer and DeepSpeed ZeRO-3 strategy. |
General | Observability: Ray Dashboards | Ray provides a very nice dashboard for different types of views, such as monitoring resource utilization, monitoring job status, logs, and error messages for tasks and actors, as well as monitoring Ray Serve applications. For more details, see here. |
Use cases | Companies moving to Ray | There are public sessions that talk about how Amazon performed an exabyte-scale migration from Spark to Ray, and how Instacart is building their ML platform as well as scaling ML fulfillment on Ray |
Once this setup is complete, we can interact with the Vicuna 13B (16K) model in one of the two ways.
- Streamlit in SPCS app: Go through the instructions below to set it up. This streamlit app also features a streamlit feedback component so that users can provide feedback on the LLM output. The feedback is stored in a snowflake table and the results can be seen in the Model Monitoring table within Streamlit UI.
- Alternatively, you can also interact with the model through notebook within the SPCS setup. Refer to
ui/notebooks/test_local_vicuna13b_16ktokens_chat.ipynb
.
- Docker
- SnowSQL. Installation for installing here: https://docs.snowflake.com/en/user-guide/snowsql-install-config. After installation, please check you are able to run
snowsql -v
in a new terminal. If that command doesn't work, it means that the terminal is not able to look up the installed snowsql. In that case, after snowsql installation, put an alias to snowsql in ~/.bash_profile and runsource ~/.bash_profile
before going ahead with the steps below. - Access to Snowpark Container Services in Private Preview. Note that you must have the ability to create a GPU_3 compute pool with 1 node and one GPU_7 compute pool with 2 nodes.
Note that this setup has been tested on MacOS Ventura 13.6.
- Execute these statements in snowsight or visual studio extension for Snowflake. Change to your database and schema but don't change anything else.
Note down the result of the last statement. Using accountadmin role, execute the following statement:
create database if not exists MYDB; use database MYDB; create schema if not exists vicuna13bonrayserve; use schema vicuna13bonrayserve; create stage if not exists SPEC_STAGE; create image repository if not exists LLM_REPO; SHOW IMAGE REPOSITORIES IN SCHEMA;
CREATE SECURITY INTEGRATION snowservices_ingress_oauth TYPE=oauth OAUTH_CLIENT=snowservices_ingress ENABLED=true;
- Setup snowsql and give the connection a name. In my case, I added the following code block to the
~/.snowsql/config
, with the connection name as fcto.NOTE: DO NOT USE credentials for[connections.fcto] accountname = XXX username = XXX password = XXX warehouse = XXX dbname = XXX schemaname = XXX rolename = XXX
ACCOUNTADMIN
,SECURITYADMIN
, andORGADMIN
roles. As noted here, Snowpark Container Services use Snowflake OAuth for ingress. In the current implementation, privileged roles (including ACCOUNTADMIN, SECURITYADMIN, and ORGADMIN) cannot be used with OAuth. - Update REGISTRY_URL_BASE in
bin/do_login.sh
. Once updated, please runsh bin/do_login.sh
to login into docker. - Update following variables in
configure_project.sh
.#these variables definitely need to be changed repository_url="myaccount.registry.snowflakecomputing.com/mydb/vicuna13bonrayserve/llm_repo" database="mydb" schema="vicuna13bonrayserve" spec_stage="spec_stage" hf_token="X" snowsql_connection_name=fcto #these variables are good enough for the Vicuna model on Ray Serve in SPCS. No need to change num_ray_workers=0 num_additional_special_ray_workers_for_ray_serve=2 ray_head_node_type=GPU_3 ray_worker_node_type=NA special_ray_worker_for_ray_serve_node_type=GPU_7 default_compute_pool_keep_alive_secs=120 ray_head_compute_pool_name=VICUNA13B_RAY_HEAD_POOL ray_worker_compute_pool_name=NA rayserve_compute_pool_name=VICUNA13B_RAY_SERVE_POOL streamlit_feedback_table_name=ST_FEEDBACK job_manifest_file=ray_serve_vllm_vicuna13b_manifest_v27.yaml
- Make the
configure_project.sh
executable by runningchmod +x configure_project.sh
- There are seven options to run
configure_project.sh
. Those will beaction=update_variables
,action=reset_variables
,action=deploy_all
,action=drop_all
,action=drop_services
,action=deploy_streamlit
andaction=get_endpoints
. Follow this sequence:- Replace placeholder values in all the spec files, config files and makefiles by running
./configure_project.sh --action=update_variables
. Just FYI: you can also reset the variables to placeholder values in all the spec files, config files and makefiles by running./configure_project.sh --action=reset_variables
.
Note that./configure_project.sh --action=update_variables
must be run before deploying. - In order to deploy everything including the compute pools, services, volumes, jobs and needed streamlit table for feedback, you can run
./configure_project.sh --action=deploy_all
. After starting up Ray, the code will deploy Vicuna 13B model on Ray Serve. This is a synchronous job, so theconfigure_project.sh
execution will appear as if it is hung, when it actually is not (this should take around 10 minutes to fully deploy Vicuna 13B model on Ray Serve). In order to see its status, either- Navigate to Snowsight Query History and look for the query
EXECUTE SERVICE IN COMPUTE POOL VICUNA13B_RAY_HEAD_POOL';
- Or execute the commands in
get_job_status.sql
.
- Navigate to Snowsight Query History and look for the query
- Once the script execution finishes, the script will spit out URLs for Ray head node (ray dashboard, jupyter notebook, grafana, prometheus and RayServe API); as well as URL for Streamlit app. Browse to those URLs. These URLs are public but authenticated by user's Snowflake username/password. Alternatively, you can also get these urls by putting the following in Snowflake worksheet:
call get_service_public_endpoints('<your_database>', '<your_schema>', ‘streamlit');
for the streamlit endpoint andcall get_service_public_endpoints('<your_database>', '<your_schema>', spcs_ray_custom_head_service');
for the Ray dashboard/jupyter notebook/grafana/prometheus/rayserveapi endpoints.- For accessing Ray Serve API inside the SPCS Ray cluster, open the
notebook
url from terminal output in browser and upload the notebookui/notebooks/test_local_vicuna13b_16ktokens_chat.ipynb
to/home/snowflake
location within jupyter. The default jupyter password is admin. - For accessing Streamlit app, open the
streamlit
url from the terminal output in browser and directly interact with the model. This streamlit app does not have any existing prompt. - The Grafana dashboard will be available at
<https://GRAFANA_PUBLIC_URI>/d/rayDefaultDashboard/?var-datasource=Prometheus
. Default username/password for grafana is admin/admin. The first time you login into that url, you will see an error saying you need dashboard:read permission. Just login on the right with the admin/admin as username/password for grafana and then you will be able to see the dashboard. See that dashboard below.
- For accessing Ray Serve API inside the SPCS Ray cluster, open the
- In order to tear down everything including the compute pools, services, volumes, and needed streamlit table for feedback, you can run
./configure_project.sh --action=drop_all
. - Alternatively, in order to just tear down just the services while keeping the compute pools intact, you can run
./configure_project.sh --action=drop_services
. Note that this will result in compute pools getting suspended after 2 minutes which is the configured time after which the compute pool will auto shutdown if there is no service active on it.
- Replace placeholder values in all the spec files, config files and makefiles by running
Here are a few things that might happen during this setup. If they do, please follow the solutions below. In case your topic is not covered, see the next section on where to get help.
-
No space left on device
- The docker images in this repo are nvidia/cuda:11.8.0-devel-ubuntu22.04 images, which are quite heavy in size. If no space left on device issue occurs (another indication of this is that it might fail at
RUN pip3 install -r usecase_requirements.txt
during docker build), you must increase the available disk space on Docker Desktop. In Docker desktop, go to Settings -> Resources -> Virtual Disk Limit.
- The docker images in this repo are nvidia/cuda:11.8.0-devel-ubuntu22.04 images, which are quite heavy in size. If no space left on device issue occurs (another indication of this is that it might fail at
-
Requested number of nodes X exceeds the node limit for the account
- This might indicate that your account doesn't have enough GPU_3 and GPU_7 assigned. You just need 1 GPU_3 compute pool and 1 GPU_7 compute pool with 2 instances. Please reach out to me for help and I will get in touch with the responsible folks.
-
Could not connect to Snowflake backend, aborting
- This means that you might be having internet connecting issues, or snowsql is not properly configured.
-
Failed reading *.yaml in stage
- This might mean that the docker images might not have been built and pushed correctly. Please retry by typing
./configure_project.sh --action=deploy_all
- This might mean that the docker images might not have been built and pushed correctly. Please retry by typing
-
SQL compilation error: An active warehouse is required for creating Python Stored Procedure
- This might mean that the warehouse name specified in the snowsql config connection is either incorrect, or the user does not have a default warehouse assigned. Please configure snowsql properly or provide a default warehouse for the snowflake user.
This repo automates Ray on SPCS setup. If you come across any issues, please reach out to puneet.lakhanpal@snowflake.com. I would love to hear any feedback how this experience can be further improved. I will be pushing this repo on GIT, so please create an issue in GIT if you come across any issues.