AI Hypercomputer
Reference implementations, benchmarks, recipes, and all things Google Cloud AI Hypercomputer
Pinned Repositories
gpu-recipes
Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.
JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
jetstream-pytorch
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"
maxdiffusion
maxtext
A simple, performant and scalable Jax LLM!
ml-goodput-measurement
RecML
torchprime
torchprime is a reference model implementation for PyTorch on TPU.
tpu-recipes
xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
AI Hypercomputer's Repositories
AI-Hypercomputer/maxtext
A simple, performant and scalable Jax LLM!
AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
AI-Hypercomputer/maxdiffusion
AI-Hypercomputer/RecML
AI-Hypercomputer/xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
AI-Hypercomputer/gpu-recipes
Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.
AI-Hypercomputer/jetstream-pytorch
PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"
AI-Hypercomputer/tpu-recipes
AI-Hypercomputer/torchprime
torchprime is a reference model implementation for PyTorch on TPU.
AI-Hypercomputer/ml-goodput-measurement
AI-Hypercomputer/cloud-accelerator-diagnostics
AI-Hypercomputer/pathways-utils
Package of Pathways-on-Cloud utilities
AI-Hypercomputer/inference-benchmark
AI-Hypercomputer/ray-tpu
AI-Hypercomputer/kithara
AI-Hypercomputer/cloud-tpu-monitoring-debugging
AI-Hypercomputer/cloud-diagnostics-xprof
AI-Hypercomputer/resiliency
This project provides resilient training logic to improve the training efficiency of large-scale distributed workloads on Google Cloud.
AI-Hypercomputer/accelerator-microbenchmarks
AI-Hypercomputer/google-cloud-mldiagnostics
AI-Hypercomputer/aotc