/vec-playbook

Primary LanguagePythonMIT LicenseMIT

Vector Institute Compute Playbook

A comprehensive starter repository for researchers at the Vector Institute to get started with high-performance computing on Bon Echo and Killarney clusters. This playbook provides everything you need to run machine learning experiments at scale, from basic cluster usage to advanced distributed training workflows.

๐Ÿš€ What's Inside

This repository provides two main components:

๐Ÿ“š Getting Started Documentation

  • Cluster Introduction: Complete guide to connecting to and using Vector compute resources
  • Slurm Examples: Real-world examples showing how to submit jobs, run distributed training, and use cluster services
  • Migration Guide: Instructions for moving from legacy Bon Echo to the new Killarney cluster

๐Ÿงช ML Training Templates

  • Ready-to-run examples for different ML domains (LLM, VLM, MLP, RL)
  • Hydra + Submitit integration for configurable experiments and hyperparameter sweeps
  • Cluster-optimized configs for different hardware setups (A40, A100, H100, L40S)
  • Checkpointing & requeue support for long-running jobs

๐Ÿƒโ€โ™‚๏ธ Quick Start

1. Prerequisites

  • Access to Vector Institute compute clusters (Bon Echo or Killarney)
  • uv package manager installed

2. Clone and Setup

# Clone the repository
git clone https://github.com/VectorInstitute/vec-playbook.git
cd vec-playbook

# Install dependencies
uv sync

3. Configure Your Account

Edit templates/configs/user.yaml with your Slurm account details:

user:
  slurm:
    account: YOUR_ACCOUNT

4. Run Your First Job

# Simple MLP training on Killarney L40S
uv run python -m mlp.single.launch compute=killarney/l40s_1x requeue=off --multirun

๐Ÿ“– Navigation Guide

For New Users

  1. Start here: Getting Started Documentation - Learn the basics of Vector compute
  2. Try examples: Slurm Examples - Run simple jobs to get familiar
  3. Use templates: Templates - Run ML training experiments

For Experienced Users

๐Ÿ–ฅ๏ธ Supported Hardware

Bon Echo Cluster

  • A40 GPUs: 1x, 4x configurations
  • A100 GPUs: 1x, 4x configurations

Killarney Cluster

  • H100 GPUs: 1x, 8x configurations
  • L40S GPUs: 1x, 2x configurations

๐Ÿ“š Documentation Structure

vec-playbook/
โ”œโ”€โ”€ getting-started/           # ๐Ÿ“– Learning resources
โ”‚   โ”œโ”€โ”€ introduction-to-vector-compute/  # Cluster basics
โ”‚   โ””โ”€โ”€ slurm-examples/        # ๐Ÿงช Hands-on examples
โ”œโ”€โ”€ templates/                # ๐Ÿงฌ ML training templates
โ”‚   โ”œโ”€โ”€ src/                  # Template source code
โ”‚   โ””โ”€โ”€ configs/              # Cluster & experiment configs
โ””โ”€โ”€ README.md                 # This file

๐Ÿค Contributing

We welcome contributions! Whether it's:

  • New training templates
  • Additional cluster configurations
  • Documentation improvements
  • Bug fixes