Pre-training loss is the key factor for downstream tasks and emergent ablities, not model size and data size:
SLMs are very suitable for finetuning to specific tasks and domains.
- Octopus: On-device language model for function calling of software APIs
- Octopus v2: On-device language model for super agent
- Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
- Octopus v4: Graph of language models
Injecting knowledge is crucial. Use RAG or FT? That's a question:
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
- INJECTING NEW KNOWLEDGE INTO LARGE LANGUAGE MODELS VIA SUPERVISED FINE-TUNING
- Adapting Large Language Models via Reading Comprehension
- RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture
- RAFT: Adapting Language Model to Domain Specific RAG
Different architecture or SLMs:
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, which use parameter sharing techniques.
- Gemma2, which combine local attention and global attention.
Knowledge distillation is great:
- Gemma2, which use knowledge distillation in PT(Pre-Training) and IT(Instruction-Tuned) stage.
More data and longer training is good(but only with log scale):
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies, which uses Warmup-Stable-Decay(WSD) lr scheduler for longer training time.
- Index-1.9B, which also use WSD.
More inference-time compute benefit LLMs, so it's a natural question: Is it a better way to use cheaper and faster SLMs instead of LLMs to do complex reasoning in inference time?