/Efficiently-Serving-LLMs

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

Primary LanguageJupyter Notebook

💻 Welcome to the "Efficiently Serving Large Language Models" course! Instructed by Travis Addair, Co-Founder and CTO at Predibase, this course will deepen your understanding of serving LLM applications efficiently.

Course Summary

In this course, you'll delve into the optimization techniques necessary to efficiently serve Large Language Models (LLMs) to a large number of users. Here's what you can expect to learn and experience:

  1. 🤖 Auto-Regressive Models: Understand how auto-regressive large language models generate text token by token.

  1. 💻 LLM Inference Stack: Implement foundational elements of a modern LLM inference stack, including KV caching, continuous batching, and model quantization.

  1. 🛠️ LoRA Adapters: Explore the details of how Low Rank Adapters (LoRA) work and how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously.

  1. 🚀 Hands-On Experience: Get hands-on with Predibase’s LoRAX framework inference server to see optimization techniques in action.

Key Points

  • 🔎 Learn techniques like KV caching to speed up text generation in Large Language Models (LLMs).
  • 💻 Write code to efficiently serve LLM applications to a large number of users while considering performance trade-offs.
  • 🛠️ Explore the fundamentals of Low Rank Adapters (LoRA) and how Predibase implements them in the LoRAX framework inference server.

About the Instructor

🌟 Travis Addair is the Co-Founder and CTO at Predibase, bringing extensive expertise to guide you through efficiently serving Large Language Models (LLMs).

🔗 To enroll in the course or for further information, visit deeplearning.ai.