/awesome-cheap-llms

Cost reduction tools and techniques for LLM based systems

Primary LanguagePython

awesome-cheap-llms

πŸ’› Costs of RAG based applications
πŸ’™ Follow Joanna on LinkedIn βž• Follow Magdalena on LinkedIn
🀍 Sign up to DataTalksClub LLM Zoomcamp
⭐ Give this repository a star to support the initiative!


Alt text

πŸ‘‰ Let’s make sure that your LLM application doesn’t burn a hole in your pocket.
πŸ‘‰ Let’s instead make sure your LLM application generates a positive ROI for you, your company and your users.
πŸ‘‰ A nice side effect of choosing cheaper models over expensive models: the response time is shorter!

Techniques to reduce costs

Alt text

1) πŸ“˜ Choose model family and type

Selecting a suitable model or combination of models based on factors, such as speciality, size and benchmark results, builds the foundation for developing cost-sensible LLM applications. The aim is to choose a model that fits the complexity of the task. Same as you wouldn't take your BMW 8 Series M to a grocery store, you don't need to use a high-end LLM for simple tasks.

Papers

Tools & frameworks

Blog posts & courses

2) πŸ“˜ Reduce model size

After chosing the suitable model family, you should consider models with fewer parameters and other techniques that reduce model size.

  • Model parameter size (i.e. 7B, 13B ... 175B)
  • Quantization (= reducing the precision of the model's parameters)
  • Pruning (= removing unnecessary weights, neurons, channels or layers)
  • Knowledge Distillation (= training smaller model that mimics a larger model)

Papers

Tools & frameworks

Blog posts & courses

3) πŸ“˜ Use open source models

Consider self-hosting models instead of using proprietary models if you have capable developers in house. Still, have an oversight of Total Cost of Ownership, when benchmarking managed LLMs vs. setting up everything on your own.

Papers

  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Tools & frameworks

Blog posts & courses

4) πŸ“˜ Reduce input/output tokens

A key cost driver is the amount of input tokens (user prompt + context) and output tokens, that you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs.
Input tokens:

  • Chunking of input documents
  • Compression of input tokens
  • Summarization of input tokens
  • Test viability of zero-shot prompting before adding few-shot examples
  • Experiment with simple, concise prompts before adding verbose explanations and details

Output tokens:

  • Prompting to instruct the LLM how many output tokens are desired
  • Prompting to instruct the LLM to be concise in the answer, adding no explanation text to the expected answer

Papers

  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Tools & frameworks

  • LLMLingua by Microsoft to compress input prompts
  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Blog posts & courses

5) πŸ“˜ Prompt and model routing

Send your incoming user prompts to a model router (= Python logic + SLM) to automatically choose a suitable model for actually answering the question. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading").

Tools & frameworks

Blog posts & courses

6) πŸ“˜ Caching

If your users tend to send semantically similar or repetitive prompts to your LLM system, you can reduce costs by using different caching techniques. The key lies in developing a caching strategy, that does not only look for exact matches, but rather semantic overlap to have a decent cache hit ratio.

  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Tools & frameworks

Blog posts & courses

  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

7) πŸ“˜ Rate limiting

Make sure one single customer is not able to penetrate your LLM and skyrocket your bill. Track amount of prompts per month per user and either hard limit to max amount of prompts or reduce response time when a user is hitting the limit. In addition, detect unnatural/sudden spikes in user requests (similar to DDOS attacks, users/competitors can harm your business by sending tons of requests to your model).

Tools & frameworks

  • Simple tracking and rate limiting logic can be implemented in native Python
  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Blog posts & courses

8) πŸ“˜ Cost tracking

"You can't improve what you don't measure" --> Make sure to know where your costs are coming from. Is it super active users? Is it a premium model? etc.

Tools & frameworks

  • Simple tracking and cost attribution logic can be implemented in native Python
  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

Blog posts & courses

  • πŸ—£οΈ call-for-contributions πŸ—£οΈ

9) πŸ“˜ During development time

  • Make sure to not send endless API calls to your LLM during development and manual testing.
  • Make sure to not send automated API calls to your LLM via automated CICD workflows, integration tests etc.

Contributions welcome

  • We’re happy to review and accept your Pull Request on LLM cost reduction techniques and tools.
  • We plan to divide the content into subpages to further structure all chapters.