Recently, we have seen great advancements in natural language processing, especially in pre-trained large language models (LLMs). Besides, I believe the applications of LLMs must excite everyone of you, including me. So I want to delve into the overall process of how LLMs are applied and build up our own technology stack.
It is common to witness such an awkward scenario where we don't have enough GPU memory to deploy LLMs. Needless to say, we cannot adopt LLMs into our domain with their inference. So I first introduce some quantization techniques to alleviate our GPU stress, as follows:
- Quantization
Perhaps you might be interested in Dan Alistarh, who is the author of GTQT techniques. GTQP has ...... More details about him, you can refer to Dan Alistarh
- Transformer Library https://huggingface.co/blog/llama2#using-transformers
- vLLM
- TGI
- Web generation UI
First, I recommend some straightforward lists about the performance of existing LLMs, as follows:
-
English
- open-llm-learderboard
- alpaca
-
Chinese
Some frameworks and papers to explore:
- Metrogan
- Deepspeed
- vLLM
- k8s
- Docker
- Flash attention 1 and 2
- RLHF
- training custom LLMs
- Qlora
- Finetuning
- Landmark attention
- mysys
- cuda 算子
- 字节AML 阿里PAI
- GPT cache
- 算子
- cuda
- 推理引擎
- 训练框架
- 机器学习平台
- cot
- flan
- orca
- platypus
- peft
- ds
- RHLF
- RHAHL
- A Full List on TinyML http://tinyml.seas.harvard.edu/courses/
- MIT 6.5940 https://hanlab.mit.edu/courses/2023-fall-65940
- ESE3600 https://tinyml.seas.upenn.edu/
- A Full List on CS https://github.com/Developer-Y/cs-video-courses