/distributed-inference-llm

Serve Llama 2 (7B/13B/70B) Large Language Models efficiently at scale by leveraging heterogeneous Dell™ PowerEdge™ Rack servers in a distributed manner.

Primary LanguagePython

Distributed Inference for Large Language Models: Embracing Heterogeneous Hardware

Serve Llama 2 (7B/13B/70B) Large Language Models efficiently at scale by leveraging heterogeneous Dell™ PowerEdge™ Rack servers in a distributed manner.

The developer documentation provides a comprehensive guide for serving Llama 2 Large Language Models (LLMs) in a distributed environment, using Dell™ PowerEdge™ Rack servers networked with Broadcom® Ethernet Network Adapters. This document aims to empower developers with the knowledge and tools necessary to serve LLMs for maximum inference performance by creating GPU and CPU clusters . By leveraging the computational power of Dell™ PowerEdge™ Rack servers, networking capability of Broadcom® Ethernet Network Adapters and the scalability with Kubernetes, developers can efficiently serve LLMs to meet their application requirements.

Prerequisites

Software Requirements

Before we start, make sure you have the following software components set up on your Dell™ PowerEdge™ Rack servers:

Hardware Requirements

Additionally, ensure you have the following hardware components at your disposal:

The solution was tested with the below hardware stack
Server CPU RAM Disk GPU
Dell™ PowerEdge™ XE9680 Intel® Xeon® Platinum 8480+ 2 TB 3TB 8xNVIDIA® A100 Tensor Core 80GB SXM GPUs
Dell™ PowerEdge™ XE8545 AMD EPYC™ 7763 64-Core Processor 1 TB 2 TB 4xNVIDIA® A100 Tensor Core 80GB SXM GPUs
Dell™ PowerEdge™ R760xa Intel® Xeon® Platinum 8480+ 1 TB 1 TB 4xNVIDIA® H100 Tensor Core 80GB PCIe GPUs
Dell™ PowerEdge™ R7625 AMD EPYC™ 9354 32-Core Processor 1.5 TB 1 TB 1xAMD Instinct™ MI210 Accelerator GPU

Deployment Guide

1. Setting up a Distributed Cluster

Estimated Time: 40 mins ⏱️

To set the stage for distributed inferencing, we'll start by configuring our cluster with precision. Using Kubernetes as our orchestration system, we'll establish a cluster comprising of a designated server/head node and multiple agent/worker nodes. The flexibility lies in your hands: you can choose to include the head node in the distributed inferencing process or opt for an additional worker node on the same machine as the head node, enhancing the power of your computational setup.

To optimize our resources further, we'll integrate NVIDIA® device plugins for Kubernetes and AMD device plugins for Kubernetes using Helm.

This streamlined integration ensures that our distributed inferencing environment is equipped with the necessary acceleration capabilities.

Get started with setting up your distributed cluster by following steps.

2. KubeRay Installation and Kubernetes Secrets

Estimated Time: 10 mins ⏱️

We'll be introducing KubeRay to our Kubernetes cluster and configure a Kubernetes secret for handling inference docker images. This step guarantees a robust and scalable infrastructure for our distributed inference.

3. Serving the LLM Models

Estimated Time: 1hr ⏱️

We will be executing the distributed inference of the cutting-edge Llama 2 70B model developed by Meta AI. To accomplish this, we're harnessing the power of robust tool Ray Serve and inference backends optimized for the available hardware cluster, ensuring an efficient and scalable workflow.

Let's deploy the Llama 2 models on your cluster for distributed inference.

4. Monitoring Cluster and Serving

The Ray Dashboard offers real-time insights into serving. Accessible at http://127.0.0.1:30265, this dashboard grants you access to the Serve tab, allowing you to monitor serve applications deployed and its logs. Meanwhile, the Cluster tab provides a comprehensive overview of device statistics across your cluster, ensuring you're always in control.

The Grafana Dashboard empowers you with comprehensive system and cluster metrics. Every dashboard provides in-depth insights into Kubernetes cluster and Node-level metrics, offering a seamless way to monitor and troubleshoot your nodes and deployments.

Dive into monitoring your cluster.

5. Running Inference with Chatbot

Use the application end point to run a chatbot using Gradio.

Further Reading

To gain a comprehensive understanding of every facet involved in serving a Large Language Model (LLM), explore the following resources:

Resource Link
Llama 2 LLM Model Meta AI-Llama 2 Page
Llama 2 Research Paper
Ray Serve Ray Serve GitHub repo
Ray Service Ray Service Documentation
vLLM vLLM GitHub
Intel BigDL BigDL GitHub
Optimum ONNXRuntime HuggingFace Optimum ONNX Runtime
ONNX Runtime
Optimum ONNXRuntime GitHub