MosaicML deep dive for Databricks technical field

MosaicML makes fine-tuning and pretraining LLMs much easier and more scalable than using open-source libraries on Databricks clusters. This mini-workshop will walk through the process of fine-tuning a foundation model in MosaicML, including the following steps:

Set up a development environment to use the Databricks-MosaicML integration
Acquire source data for continued pretraining and fine-tuning tasks
Use PySpark to process the raw source data into a format amenable to MosaicML
Use the MosaicML CLI/SKD to carry out the fine-tuning tasks
Review the model training metrics and sample prompts in MLflow, and the training logs in MCLI
Provision a model serving endpoint from the custom model that was registered to Unity Catalog
Evaluate model performance on new prompts

The demo will adapt an existing MosaicML end-to-end demo to leverage much of the integrated Databricks-MosaicML stack as of this writing (December 2023).

Prerequisites

We recommend completing all of the prerequisites before attempting to run any of the notebooks. Any key resources created in the prequisites and needed in the workshop will be referenced in the config notebook. We will identify them as we go.

MosaicML account

If you don't already have an account, request access to the MosaicML playground & fine-tuning API through go/getfinetuning. More details are available in the fine-tuning launch email sent to bricksters.

MCLI access

Go to the MosaicML console and create an API key
Create Databricks secrets in e2-dogfood. In the config notebook, use this secret scope and key to populate the mcli_secret_scope and mcli_api_key values
(Optional, recommended) if you prefer to use a terminal or IDE, follow the Getting Started docs to set up mcli locally

AWS resources

Follow the guidance in the Field Eng Cloud Resources Guide. Because we need to create IAM Roles, we will operate within the aws-sandbox-field-eng AWS account, which provides temporary resources only. Follow the guide's instructions to log into this account and create the following:

An S3 bucket (s3_bucket config value)
Folders in the bucket to hold training data and model checkpoints (s3_folder_continued_pretrain_train, s3_folder_continued_pretrain_validation, s3_folder_checkpoints_cpt, s3_folder_checkpoints_ift config values). The values do not need to be updated from the defaults; just make sure that you match the paths of the folders that you actually create.
IAM User. This user needs full S3 permissions to the bucket you created
AWS Access Key for the IAM User. Record these values; you will need to use them in two places in the demo:
1. Set up authentication from MosaicML to your S3 bucket by following the steps in the MosaicML S3 docs. This is most easily done from your local terminal, where you hopefully set up mcli in the previous step.
2. Your Databricks cluster will need access to S3, so be sure to add them as secrets to e2-dogfood (the aws_secret_scope, aws_access_key and aws_secret_access_key config values)

Databricks

MosaicML will need access to a Databricks workspace to log metrics and model checkpoints.

Create a PAT in e2-dogfood
Create a Databricks secret in mcli using this PAT

Unity Catalog

We recommend creating a UC schema (uc_schema config) for this workshop. It will hold the training data tables (uc_table prefix config) and the registered, fine-tuned LLM. The fine-tuning API logs the final model checkpoint directly into the UC model registry in the transformers mlflow model flavor, so it can easily be deployed for optimized model serving

MLflow

We recommend creating an MLflow experiment for each combination of training data and model training objective. In this case, that's an experiment for the continued pretraining step (mlflow_experiment_name_cpt config) and an experiment for the instruction fine-tuning step (mlflow_experiment_name_ift config). The fine-tuning API will log the training run configuration, metrics and (optionally) sample prompt generations directly to MLflow

Compute

The heavy lifting of LLM fine-tuning will be handled by MosaicML's serverless compute service, MCloud. On the Databricks side, you just need a modest compute cluster to run data prep workloads, and to interact with the model serving endpoint. For this, a small cluster running MLR 13.3 LTS or later will suffice. Only CPUs are needed.

When you provision the cluster, be sure to create environment variables in the cluster configs that provide the boto3 client secure access to the S3 credentials you established previously.

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

How to run through the workshop

Complete all the prerequisites, populating the config notebook as you go
Run the feature_transforms notebook
Run the finetune notebook
(Future) Run the deployment notebook
(Optional) review the yaml files for continued pretraining and instruction fine-tuning
(Optional) compare sample prompts in the AI playground between the custom model you've deployed and llama2-70b-chat for SEC-related questions

coreyabs-db/finetuning_demo