MosaicML makes fine-tuning and pretraining LLMs much easier and more scalable than using open-source libraries on Databricks clusters. This mini-workshop will walk through the process of fine-tuning a foundation model in MosaicML, including the following steps:
- Set up a development environment to use the Databricks-MosaicML integration
- Acquire source data for continued pretraining and fine-tuning tasks
- Use PySpark to process the raw source data into a format amenable to MosaicML
- Use the MosaicML CLI/SKD to carry out the fine-tuning tasks
- Review the model training metrics and sample prompts in MLflow, and the training logs in MCLI
- Provision a model serving endpoint from the custom model that was registered to Unity Catalog
- Evaluate model performance on new prompts
The demo will adapt an existing MosaicML end-to-end demo to leverage much of the integrated Databricks-MosaicML stack as of this writing (December 2023).
We recommend completing all of the prerequisites before attempting to run any of the notebooks. Any key resources created in the prequisites and needed in the workshop will be referenced in the config
notebook. We will identify them as we go.
If you don't already have an account, request access to the MosaicML playground & fine-tuning API through go/getfinetuning. More details are available in the fine-tuning launch email sent to bricksters.
- Go to the MosaicML console and create an API key
- Create Databricks secrets in e2-dogfood. In the
config
notebook, use this secret scope and key to populate themcli_secret_scope
andmcli_api_key
values - (Optional, recommended) if you prefer to use a terminal or IDE, follow the Getting Started docs to set up mcli locally
Follow the guidance in the Field Eng Cloud Resources Guide. Because we need to create IAM Roles, we will operate within the aws-sandbox-field-eng AWS account, which provides temporary resources only. Follow the guide's instructions to log into this account and create the following:
- An S3 bucket (
s3_bucket
config value) - Folders in the bucket to hold training data and model checkpoints (
s3_folder_continued_pretrain_train
,s3_folder_continued_pretrain_validation
,s3_folder_checkpoints_cpt
,s3_folder_checkpoints_ift
config values). The values do not need to be updated from the defaults; just make sure that you match the paths of the folders that you actually create. - IAM User. This user needs full S3 permissions to the bucket you created
- AWS Access Key for the IAM User. Record these values; you will need to use them in two places in the demo:
- Set up authentication from MosaicML to your S3 bucket by following the steps in the MosaicML S3 docs. This is most easily done from your local terminal, where you hopefully set up mcli in the previous step.
- Your Databricks cluster will need access to S3, so be sure to add them as secrets to e2-dogfood (the
aws_secret_scope
,aws_access_key
andaws_secret_access_key
config values)
MosaicML will need access to a Databricks workspace to log metrics and model checkpoints.
- Create a PAT in e2-dogfood
- Create a Databricks secret in mcli using this PAT
We recommend creating a UC schema (uc_schema
config) for this workshop. It will hold the training data tables (uc_table
prefix config) and the registered, fine-tuned LLM. The fine-tuning API logs the final model checkpoint directly into the UC model registry in the transformers mlflow model flavor, so it can easily be deployed for optimized model serving
We recommend creating an MLflow experiment for each combination of training data and model training objective. In this case, that's an experiment for the continued pretraining step (mlflow_experiment_name_cpt
config) and an experiment for the instruction fine-tuning step (mlflow_experiment_name_ift
config). The fine-tuning API will log the training run configuration, metrics and (optionally) sample prompt generations directly to MLflow
The heavy lifting of LLM fine-tuning will be handled by MosaicML's serverless compute service, MCloud. On the Databricks side, you just need a modest compute cluster to run data prep workloads, and to interact with the model serving endpoint. For this, a small cluster running MLR 13.3 LTS or later will suffice. Only CPUs are needed.
When you provision the cluster, be sure to create environment variables in the cluster configs that provide the boto3
client secure access to the S3 credentials you established previously.
AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}
- Complete all the prerequisites, populating the config notebook as you go
- Run the
feature_transforms
notebook - Run the
finetune
notebook - (Future) Run the
deployment
notebook - (Optional) review the yaml files for continued pretraining and instruction fine-tuning
- (Optional) compare sample prompts in the AI playground between the custom model you've deployed and llama2-70b-chat for SEC-related questions