Arctic (Dense-MoE) |
Snowflake |
480B Active 17B |
Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch. Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. |
HuggingFace Github Blog |
LLama 3 |
Meta AI |
8B 70B |
Llama 3 is a family of large language models, a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. It is an auto-regressive language model that uses an optimizehttps://github.com/Snowflake-Labs/snowflake-arcticd transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). |
HuggingFace Blog Github |
Phi 3 |
Microsoft |
3.8B |
Phi-3-Mini is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and publicly available website data, with an emphasis on high-quality and reasoning-dense properties. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, |
HuggingFace Blog |
OpenELM |
Apple |
270M 450M 1.1B 3B |
OpenELM, a family of Open-source Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. Trained on RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens. Released both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters. |
HuggingFace OpenELM HuggingFace OpenELM-Instruct |
Mixtral 8x22B (MoE) |
Mistral AI |
176B Active 40B |
Mixtral-8x22B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. It has contect length of 65,000 tokens. |
HuggingFace Blog |
Command-R+ |
Cohere |
104B |
C4AI Command R+ is an open weights research release of a 104B billion parameter model with highly advanced capabilities, this includes Retrieval Augmented Generation (RAG) and tool use to automate sophisticated tasks. Command R+ is optimized for a variety of use cases including reasoning, summarization, and question answering. |
Hugging Face |
Jamba (MoE) |
AI21 labs |
52B active 12B |
Jamba is a state-of-the-art, hybrid SSM-Transformer LLM. It delivers throughput gains over traditional Transformer-based models. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU. |
HuggingFace Blog |
DBRX (MoE) |
Databricks |
132B Active 36B |
DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts which improves model quality. |
HuggingFace Github Blog |
Grok 1.0 (MoE) |
xAI |
314B |
Grok 1.0 uses Mixture of 8 Experts (MoE). Grok 1.0 is not fine-tuned for specific applications like dialogue but showcases strong performance compared to other models like GPT-3.5 and Llama 2. It is larger than GPT-3/3.5. |
Github HuggingFace |
Gemma |
Google |
2B 7B |
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. |
HuggingFace Kaggle Github Blog |
Recurrent Gemma |
Google |
2B |
RecurrentGemma is a family of open language models built on a novel recurrent architecture. Like Gemma, RecurrentGemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Because of its novel architecture, RecurrentGemma requires less memory than Gemma and achieves faster inference when generating long sequences. |
HuggingFace Kaggle |
Mixtral 8x7B (MoE) |
Mistral AI |
45B Active 12B |
Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks. |
HuggingFace Kaggle Blog |
Qwen1.5-MoE (MoE) |
Alibaba |
14.3B Active 2.7B |
Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data. It employs Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while achieching comparable performance to Qwen1.5-7B, it only requires 25% of the training resources. |
HuggingFace |
Mistral 7B |
Mistral AI |
7B |
The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama 2 13B on most benchmarks. |
Github HuggingFace Kaggle Blog |
Mistral 7B v2 |
Mistral AI |
7B |
Mistral 7B v2 has the following changes compared to Mistral 7B:- 32k context window (vs 8k context in v0.1), Rope-theta = 1e6, No Sliding-Window Attention. |
HuggingFace |
Llama 2 |
Meta AI |
7B 13B 70B |
Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. It is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. |
HuggingFace Kaggle Github Blog |
Dolly v2 |
Databricks |
3B 7B 12B |
Dolly v2 is a causal language model created by Databricks that is derived from EleutherAI's Pythia-12b and fine-tuned on a ~15K record instruction corpus. |
HuggingFace Dolly3B HuggingFace Dolly7B HuggingFace Dolly12B Kaggle Github |
Command-R |
Cohere |
35B |
Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities. |
HuggingFace Kaggle |
Qwen1.5 |
Alibaba |
0.5B 1.8B 4B 7B 14B 32B 72B |
Qwen1.5 is a transformer-based decoder-only language model pretrained on a large amount of data. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. |
HuggingFace Github |
Vicuna v1.5 |
Lysms |
7B 13B |
Vicuna v1.5 is fine-tuned from Llama 2 with supervised instruction fine-tuning. The training data is around 125K conversations collected from ShareGPT.com. The primary use of Vicuna is research on large language models and chatbots. |
HuggingFace Vicuna7B HuggingFace Vicuna13B |
Phi 2 |
Microsoft |
2.7B |
Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. |
HuggingFace Kaggle Blog |
Orca 2 |
Microsoft |
7B 13B |
Orca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning. The model is not optimized for chat and has not been trained with RLHF or DPO. |
HuggingFace Blog |
Smaug |
Abacus AI |
34B 72B |
Smaug is created using a new fine-tuning technique, DPO-Positive (DPOP), and new pairwise preference versions of ARC, HellaSwag, and MetaMath (as well as other existing datasets). |
HuggingFace |
MPT |
Mosaicml |
1B 7B 30B |
MPT is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. These models use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases (ALiBi). |
HuggingFace Kaggle Github |
Falcon |
TLL |
7B 40B 180B |
Falcon is a 7B/40B/180B parameters causal decoder-only models built by TII and trained on 1,000B/1,500B/3,500B tokens of RefinedWeb enhanced with curated corpora. |
HuggingFace |
Yalm |
Yandex |
100B |
YaLM 100B is a GPT-like neural network for generating and processing text. It is trained on a cluster of 800 A100 graphics cards over 65 days. It is designed for text generation and processing. |
HuggingFace Github |
DeciLM |
DeciAI |
6B 7B |
DeciLM is a decoder-only text generation model. With support for an 8K-token sequence length, this highly efficient model uses variable Grouped-Query Attention (GQA) to achieve a superior balance between accuracy and computational efficiency. |
HuggingFace |
BERT |
Google |
110M to 350M |
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts. |
HuggingFace Kaggle GitHub |
Olmo |
AllenAI |
1B 7B |
OLMo is a series of Open Language Models designed to enable the science of language models. The OLMo models are trained on the Dolma dataset. |
HuggingFace Github |
Openchat3.5 |
Openchat |
7B |
Openchat2.5 is the best performing 7B LLM. |
HuggingFace Github |
Bloom |
BigScience |
176B |
BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. |
HuggingFace |
Hermes 2 Pro Mistral |
Nous Research |
7B |
Hermes 2 Pro on Mistral 7B is the new flagship 7B Hermes. Hermes 2 Pro is an upgraded, retrained version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house. This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling, JSON Structured Outputs. |
HuggingFace |
Hermes 2 Mixtral 7x8B (MoE) |
Nous Research |
Active 12B |
Nous Hermes 2 Mixtral 8x7B DPO is the new flagship Nous Research model trained over the Mixtral 8x7B MoE LLM. The model was trained on over 1,000,000 entries of primarily GPT-4 generated data, as well as other high quality data from open datasets across the AI landscape, achieving state of the art performance on a variety of tasks. This is the SFT + DPO version of Mixtral Hermes 2. |
HuggingFace |
Merlinite |
IBM |
7B |
Merlinite-7b is a Mistral-7b-derivative model trained with the LAB methodology, using Mixtral-8x7b-Instruct as a teacher model. |
HuggingFace |
Labradorite |
IBM |
13B |
Labradorite-13b is a LLaMA-2-13b-derivative model trained with the LAB methodology, using Mixtral-8x7b-Instruct as a teacher model. |
HuggingFace |
Xgen |
Salesforce |
7B |
Xgen is a Large Language Model that have a context length of 8K, 4K and are optimised for long sequence tasks. |
HuggingFace Github |
Solar |
Upstage |
10.7B |
SOLAR-10.7B, an advanced large language model (LLM) with 10.7 billion parameters, demonstrating superior performance in various natural language processing (NLP) tasks. It's compact, yet remarkably powerful, and demonstrates unparalleled state-of-the-art performance in models with parameters under 30B. |
HuggingFace |
GPT-Neox |
Eleuther AI |
20B |
GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile using the GPT-NeoX library. Its architecture intentionally resembles that of GPT-3, and is almost identical to that of GPT-J-6B. |
HuggingFace GitHub |
Flan-T5 |
Google |
80M to 11B |
FLAN-T5 is modified version of T5 and has same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. Various Sizes:- flan-t5-small, flan-t5-base, flan-t5-large, flan-t5-xxl |
HuggingFace Kaggle |
OPT |
Meta AI |
125M to 175B |
OPT are decoder-only pre-trained transformers ranging from 125M to 175B parameters. It was predominantly pretrained with English text but a small amount of non-English data is still present within the training corpus via CommonCrawl. |
HuggingFace |
Stable LM 2 |
Stability AI |
1.6B 12B |
Stable LM 2 are decoder-only language models pre-trained on 2 trillion tokens of diverse multilingual and code datasets for two epochs. |
HuggingFace |
Stable LM Zephyr |
Stability AI |
3B |
StableLM Zephyr 3B model is an auto-regressive language model based on the transformer decoder architecture. StableLM Zephyr 3B is a 3 billion parameter that was trained on a mix of publicly available datasets and synthetic datasets using Direct Preference Optimization (DPO). |
HuggingFace |
Aya |
Cohere |
13B |
The Aya model is a transformer style autoregressive massively multilingual generative language model that follows instructions in 101 languages. It has same architecture as mt5-xxl. |
HuggingFace Kaggle Blog |
Nemotron 3 |
Nvidia |
8B |
Nemotron-3 are large language foundation models for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3 is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. |
HuggingFace |
Neural Chat v3 |
Intel |
7B |
Neural Chat is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the mistralai/Mistral-7B-v0.1 on the open source dataset Open-Orca/SlimOrca. The model was aligned using the Direct Performance Optimization (DPO) method. |
HuggingFace |
Yi |
01 AI |
6B 9B 34B |
The Yi series models are the next generation of open-source large language models. They are targeted as a bilingual language model and trained on 3T multilingual corpus, showing promise in language understanding, commonsense reasoning, reading comprehension, and more. |
HuggingFace Github |
Starling LM |
Nexusflow |
7B |
Starling LM, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). Starling LM is trained from Openchat-3.5-0106 with our new reward model Starling-RM-34B and policy optimization method Fine-Tuning Language Models from Human Preferences (PPO). |
HuggingFace |
NexusRaven v2 |
Nexusflow |
13B |
NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities. NexusRaven-V2 is capable of generating deeply nested function calls, parallel function calls, and simple single calls. It can also justify the function calls it generated. |
HuggingFace |
DeepSeek LLM |
Deepseek AI |
7B 67B |
DeepSeek LLM is an advanced language model. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. |
HuggingFace Github |
Deepseek VL (Multimodal) |
Deepseek AI |
1.3B 7B |
DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. It is a hybrid vision encoder supporting 1024 x 1024 image input and is constructed based on the DeepSeek-7b-base which is trained on an approximate corpus of 2T text tokens. |
HuggingFace Github |
Llava 1.6 (Multimodal) |
Llava HF |
7B 13B 34B |
LLaVa combines a pre-trained large language model with a pre-trained vision encoder for multimodal chatbot use cases. Available models:- Llava-v1.6-34b-hf, Llava-v1.6-Mistral-7b-hf, Llava-v1.6-Vicuna-7b-hf, Llava-v1.6-vicuna-13b-hf |
Hugging Face HuggingFace |
Yi VL (Multimodal) |
01 AI |
6B 34B |
Yi-VL model is the open-source, multimodal version of the Yi Large Language Model (LLM) series, enabling content comprehension, recognition, and multi-round conversations about images. |
HuggingFace YiVL6B HuggingFace YiVL34B |