Supported Models

Chat/Instruct Models

LlaMA-like (LlamaForCausalLM):
- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- LlaMA-3: Instruct-8B, Instruct-70B, other derivations such as Llama3-8B-Chinese-Chat
- LlaMA-3.1: Instruct-8B, Instruct-70B
- LlaMA-3.2: Instruct-1B, Instruct-3B
- CodeLlaMA: Instruct-7B (-a CodeLlaMA)
- LLM-Compiler: 7B, 7B-FTD, 13B, 13B-FTD
- DeepSeek: Chat-7B (-a DeepSeek) , Coder-6.7B (-a DeepSeekCoder), Coder-Instruct-1.3B (-a DeepSeekCoder) 🔥
- Yi: (-a Yi)
  - v1: Chat-6B, Chat-34B
  - v1.5: Chat-6B, Chat-9B, Chat-34B, Chat-9B-16K, Chat-34B-16K
  - Coder: Chat-1.5B, Chat-9B
- WizardLM: LM 7B (-a WizardLM), LM 13B (-a WizardLM), Coder Python-7B (-a WizardCoder)
- TigerBot: Chat-7B, Chat-13B (-a TigerBot)
- CodeFuse-DeepSeek: 33B (-a CodeFuseDeepSeek)
- MAP-Neo: Instruct-7B (-a MAP-Neo)
- Index: Chat-1.9B, Character-1.9B
- NuminaMath: 7B-TIR
- SmolLM: (-a SmolLM)
  - v1: Instruct-1.7B
  - v2: Instruct-1.7B
- Groq: Llama-3-Groq-8B-Tool-Use (-a Llama-3-Groq-8B-Tool-Use)
For other models that using LlamaForCausalLM architecture, for example, aiXcoder-7B, try -a Yi.
Baichuan (BaichuanForCausalLM)
- Chat-7B, Chat-13B
ChatGLM (ChatGLMModel):
- ChatGLM: 6B
- ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
  
  Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g. -s "# language: python".
- CharacterGLM: 6B (-a CharacterGLM)
  
  Note: Use additional key-value pair arguments to specify characters, --kv user_name "..." bot_name "..." user_info "..." bot_info "...".
- GLM-4: Chat-9B-128k, Chat-9B-1M
- CodeGeeX4: 9B (-a CodeGeeX4)
InternLM (InternLMForCausalLM, InternLM2ForCausalLM)
- v1: Chat-7B, Chat-7B v1.1, Chat-20B
- v2: Chat-1.8B, Chat-7B, Chat-20B, Math-Plus-1.8B, Math-Plus-7B, Math-Plus-20
- v2.5: Chat-1.8B, Chat-7B, Chat-7B-1M, Chat-20B
Mistral (MistralForCausalLM, MixtralForCausalLM)
- Mistral: Instruct-7B-v0.2, Instruct-7B-v0.3
- OpenChat: 3.5 (-a OpenChat) 🔥
  
  Tip: Use system prompt to select modes: -s GPT4 (default mode), -s Math (mathematical reasoning mode).
- Starling: 7B-beta (-a Starling)
  
  Note: This is based on OpenChat, and is fully compatible with OpenChat GPT4 mode.
- WizardLM: Math 7B (-a WizardMath)
- Mixtral: Instruct-8x7B 🔥, Instruct-8x22B
  
  Three implementations of sliding-window attention (see SlidingWindowAttentionImpl):
  - Full cache: more RAM is needed.
  - Partial cache: less RAM is needed, and faster than ring cache (default).
  - Ring cache (i.e. rolling cache): least RAM, but current implementation is naive (slow). 💣
  Note: precision of these implementations differs, which causes different results.
- NeuralBeagle14: 7B (-a NeuralBeagle)
- WizardLM-2: WizardLM-2-8x22B (official link is gone) (-a WizardLM-2-MoE)
  
  Note: For MixtralForCausalLM models, --experts ... is supported to select a subset of experts when converting. For example, --experts 0,1,2,3 selects the first 4 experts.
- Codestral: 22B-v0.1
- Mistral-Nemo: Nemo-Instruct-2407
Phi (PhiForCausalLM, Phi3ForCausalLM)
- Phi-2
  
  Tip: --temp 0 is recommended. Don't forget to try --format qa.
- Dolphin Phi-2 (-a DolphinPhi2) 🐬
- Phi-3: Mini-Instruct-4k, Mini-Instruct-128k, Medium-Instruct-4k, Medium-Instruct-128k
- Phi-3.5: Mini-Instruct, MoE-Instruct
QWen (QWenLMHeadModel, Qwen2ForCausalLM, Qwen2MoeForCausalLM)
- v1: Chat-7B, Chat-14B, QAnything-7B
- v1.5: Chat-0.5B, Chat-1.8B, Chat-4B, Chat-7B, Chat-14B, CodeQwen-Chat-7B (-a CodeQwen)
- v1.5 MoE: Chat-A2.7B
- v2: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-72B
- v2 MoE: Instruct-57B-A14B (💣 not tested)
- v2.5: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-14B, Instruct-32B, Instruct-72B
- v2.5-Coder: Instruct-1.5B, Instruct-7B
- v2.5-Math: Instruct-1.5B, Instruct-7B, Instruct-72B
- Marco-o1 (-a Marco-o1)
- QwQ-32B-Preview (-a QwQ)
BlueLM (BlueLMForCausalLM)
- Chat-7B, Chat-7B 32K
Orion (OrionForCausalLM)
- Chat-14B
MiniCPM (MiniCPMForCausalLM, MiniCPM3ForCausalLM)
- DPO-2B, SFT-2B, SFT-1B🔥
- 2B-128k (Note: --temp 0 is recommended.)
- MoE-8x2B
- 4B
Adept Persimmon (PersimmonForCausalLM)
- Chat-8B
Gemma (GemmaForCausalLM)
- v1.0: Instruct-2B, Instruct-7B
- v1.1: Instruct-2B, Instruct-7B
- CodeGemma v1.1: Instruct-7B
- v2: Instruct-2B, Instruct-9B, Instruct-27B
Cohere (CohereForCausalLM)
- C4AI Command-R
- Aya-23-8B, Aya-23-35B (-a Aya-23, fully compatible with Command-R)
Zhinao (ZhinaoForCausalLM)
- Chat-7B-4K, Chat-7B-32K, Chat-7B-360K
DeepSeek (DeepseekV2ForCausalLM)
- V2-Chat (💣 not tested), V2-Lite-Chat
- Coder-V2-Instruct (💣 not tested), Coder-V2-Lite-Instruct
Two optimization modes are defined: speed (default) and memory. See BaseMLAttention.
XVERSE (XverseForCausalLM)
- Chat-7B, Chat-13B, Chat-65B
Note: Tokenizer's behavior is not 100% identical.
AllenAI (OlmoeForCausalLM)
- OLMoE: Instruct-7B
Granite (GraniteForCausalLM, GraniteMoeForCausalLM)
- v3.0 Instruct-1B-A400M, Instruct-3B-A800M, Instruct-2B, Instruct-8B

Base Models

Please use --format completion for these models.

LlaMA-like (LlamaForCausalLM):
- DeepSeek: Coder-Base-1.3B (-a DeepSeekCoder), Coder-Base-6.7B (-a DeepSeekCoder)
DeepSeek (DeepseekV2ForCausalLM)
- Coder-V2-Base (💣 not tested), Coder-V2-Lite-Base
Mistral (MistralForCausalLM, MixtralForCausalLM)
- Mistral: Base-7B-v0.1, Base-7B-v0.3
Gemma (GemmaForCausalLM)
- CodeGemma v1.1: Base-2B, Base-7B
Grok-1
- Base
  
  About Grok-1.
StarCoder (Starcoder2ForCausalLM)
- Base-3B, Base-7B, Base-15B
Stable-LM (StableLMEpochModel)
- Code-3B

RAG Models

Text Embedding (XLMRobertaModel)
- BCE-Embedding
- BGE-M3 (-a BGE-M3)
  
  Note: Only dense embedding is implemented.
QA Ranking (XLMRobertaForSequenceClassification)
- BCE-ReRanker
- BGE-ReRanker-M3 (-a BGE-Reranker-M3)

LoRA Models

These LoRA models have been tested:

Llama-3-Chinese-8B-Instruct

Special Models

Meta-AI multi-token prediction models checkpoints

Download at least one multi-token prediction checkpoint (such as 7B_1T_4). Assume it is stored at /path/to/llama-multi-predict/7B_1T_4. Make sure tokenizer.model is downloaded to /path/to/llama-multi-predict.

To convert it with -a llama-multi-token-prediction-ckpt:
```
python convert.py -i /path/to/llama-multi-predict/7B_1T_4 -o llama-multi.bin -a llama-multi-token-prediction-ckpt
```
This is a base model, and remember to use --format completion.

Tip: Use --kv n_future_tokens N to change number of future tokens, N = [1, 4].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models.md

models.md

Supported Models

Chat/Instruct Models

Base Models

RAG Models

LoRA Models

Special Models

Files

models.md

Latest commit

History

models.md

File metadata and controls

Supported Models

Chat/Instruct Models

Base Models

RAG Models

LoRA Models

Special Models