paper list

Why SLMs may work

Pre-training loss is the key factor for downstream tasks and emergent ablities, not model size and data size:

Understanding Emergent Abilities of Language Models from the Loss Perspective

Finetuning

SLMs are very suitable for finetuning to specific tasks and domains.

Agent

Specific domain

RAD-PHI2: INSTRUCTION TUNING PHI-2 FOR RADIOLOGY

Knowledge Injection

Injecting knowledge is crucial. Use RAG or FT? That's a question:

SLMs, not LLMs

Different Architecture

Different architecture or SLMs:

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, which use parameter sharing techniques.
Gemma2, which combine local attention and global attention.

Different Training Strategy

Knowledge distillation is great:

Gemma2, which use knowledge distillation in PT(Pre-Training) and IT(Instruction-Tuned) stage.

More data and longer training is good(but only with log scale):

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies, which uses Warmup-Stable-Decay(WSD) lr scheduler for longer training time.
Index-1.9B, which also use WSD.

Reasoning Machine

More inference-time compute benefit LLMs, so it's a natural question: Is it a better way to use cheaper and faster SLMs instead of LLMs to do complex reasoning in inference time?

duanyu/SLMs-paper-list