-
LlaMA-like (
LlamaForCausalLM
):- All LlaMA-1 models
- LlaMA-2: Chat-7B, etc
- LlaMA-3: Instruct-8B, Instruct-70B, other derivations such as Llama3-8B-Chinese-Chat
- LlaMA-3.1: Instruct-8B, Instruct-70B
- LlaMA-3.2: Instruct-1B, Instruct-3B
- CodeLlaMA: Instruct-7B (
-a CodeLlaMA
) - LLM-Compiler: 7B, 7B-FTD, 13B, 13B-FTD
- DeepSeek: Chat-7B (
-a DeepSeek
) , Coder-6.7B (-a DeepSeekCoder
), Coder-Instruct-1.3B (-a DeepSeekCoder
) 🔥 - Yi: (
-a Yi
)- v1: Chat-6B, Chat-34B
- v1.5: Chat-6B, Chat-9B, Chat-34B, Chat-9B-16K, Chat-34B-16K
- Coder: Chat-1.5B, Chat-9B
- WizardLM: LM 7B (
-a WizardLM
), LM 13B (-a WizardLM
), Coder Python-7B (-a WizardCoder
) - TigerBot: Chat-7B, Chat-13B (
-a TigerBot
) - CodeFuse-DeepSeek: 33B (
-a CodeFuseDeepSeek
) - MAP-Neo: Instruct-7B (
-a MAP-Neo
) - Index: Chat-1.9B, Character-1.9B
- NuminaMath: 7B-TIR
- SmolLM: (
-a SmolLM
)- v1: Instruct-1.7B
- v2: Instruct-1.7B
- Groq: Llama-3-Groq-8B-Tool-Use (
-a Llama-3-Groq-8B-Tool-Use
)
For other models that using
LlamaForCausalLM
architecture, for example, aiXcoder-7B, try-a Yi
. -
Baichuan (
BaichuanForCausalLM
) -
ChatGLM (
ChatGLMModel
):-
ChatGLM: 6B
-
ChatGLM2 family: ChatGLM2 6B, CodeGeeX2 6B, ChatGLM3 6B
Tip on CodeGeeX2: Code completion only, no context. Use system prompt to specify language, e.g.
-s "# language: python"
. -
CharacterGLM: 6B (
-a CharacterGLM
)Note: Use additional key-value pair arguments to specify characters,
--kv user_name "..." bot_name "..." user_info "..." bot_info "..."
. -
GLM-4: Chat-9B-128k, Chat-9B-1M
-
CodeGeeX4: 9B (
-a CodeGeeX4
)
-
-
InternLM (
InternLMForCausalLM
,InternLM2ForCausalLM
)- v1: Chat-7B, Chat-7B v1.1, Chat-20B
- v2: Chat-1.8B, Chat-7B, Chat-20B, Math-Plus-1.8B, Math-Plus-7B, Math-Plus-20
- v2.5: Chat-1.8B, Chat-7B, Chat-7B-1M, Chat-20B
-
Mistral (
MistralForCausalLM
,MixtralForCausalLM
)-
Mistral: Instruct-7B-v0.2, Instruct-7B-v0.3
-
OpenChat: 3.5 (
-a OpenChat
) 🔥Tip: Use system prompt to select modes:
-s GPT4
(default mode),-s Math
(mathematical reasoning mode). -
Starling: 7B-beta (
-a Starling
)Note: This is based on OpenChat, and is fully compatible with OpenChat GPT4 mode.
-
WizardLM: Math 7B (
-a WizardMath
) -
Mixtral: Instruct-8x7B 🔥, Instruct-8x22B
Three implementations of sliding-window attention (see
SlidingWindowAttentionImpl
):- Full cache: more RAM is needed.
- Partial cache: less RAM is needed, and faster than ring cache (default).
- Ring cache (i.e. rolling cache): least RAM, but current implementation is naive (slow). 💣
Note: precision of these implementations differs, which causes different results.
-
NeuralBeagle14: 7B (
-a NeuralBeagle
) -
WizardLM-2: WizardLM-2-8x22B (official link is gone) (
-a WizardLM-2-MoE
)Note: For
MixtralForCausalLM
models,--experts ...
is supported to select a subset of experts when converting. For example,--experts 0,1,2,3
selects the first 4 experts. -
Codestral: 22B-v0.1
-
Mistral-Nemo: Nemo-Instruct-2407
-
-
Phi (
PhiForCausalLM
,Phi3ForCausalLM
)-
Tip:
--temp 0
is recommended. Don't forget to try--format qa
. -
Dolphin Phi-2 (
-a DolphinPhi2
) 🐬 -
Phi-3: Mini-Instruct-4k, Mini-Instruct-128k, Medium-Instruct-4k, Medium-Instruct-128k
-
Phi-3.5: Mini-Instruct, MoE-Instruct
-
-
QWen (
QWenLMHeadModel
,Qwen2ForCausalLM
,Qwen2MoeForCausalLM
)- v1: Chat-7B, Chat-14B, QAnything-7B
- v1.5: Chat-0.5B, Chat-1.8B, Chat-4B, Chat-7B, Chat-14B, CodeQwen-Chat-7B (-a
CodeQwen
) - v1.5 MoE: Chat-A2.7B
- v2: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-72B
- v2 MoE: Instruct-57B-A14B (💣 not tested)
- v2.5: Instruct-0.5B, Instruct-1.5B, Instruct-7B, Instruct-14B, Instruct-32B, Instruct-72B
- v2.5-Coder: Instruct-1.5B, Instruct-7B
- v2.5-Math: Instruct-1.5B, Instruct-7B, Instruct-72B
- Marco-o1 (-a
Marco-o1
) - QwQ-32B-Preview (-a
QwQ
)
-
BlueLM (
BlueLMForCausalLM
) -
Orion (
OrionForCausalLM
) -
MiniCPM (
MiniCPMForCausalLM
,MiniCPM3ForCausalLM
) -
Adept Persimmon (
PersimmonForCausalLM
) -
Gemma (
GemmaForCausalLM
)- v1.0: Instruct-2B, Instruct-7B
- v1.1: Instruct-2B, Instruct-7B
- CodeGemma v1.1: Instruct-7B
- v2: Instruct-2B, Instruct-9B, Instruct-27B
-
Cohere (
CohereForCausalLM
)- C4AI Command-R
- Aya-23-8B, Aya-23-35B (
-a Aya-23
, fully compatible with Command-R)
-
Zhinao (
ZhinaoForCausalLM
) -
DeepSeek (
DeepseekV2ForCausalLM
)-
V2-Chat (💣 not tested), V2-Lite-Chat
-
Coder-V2-Instruct (💣 not tested), Coder-V2-Lite-Instruct
Two optimization modes are defined: speed (default) and memory. See
BaseMLAttention
. -
-
XVERSE (
XverseForCausalLM
)Note: Tokenizer's behavior is not 100% identical.
-
AllenAI (
OlmoeForCausalLM
)- OLMoE: Instruct-7B
-
Granite (
GraniteForCausalLM
,GraniteMoeForCausalLM
)
Please use --format completion
for these models.
-
LlaMA-like (
LlamaForCausalLM
):- DeepSeek: Coder-Base-1.3B (
-a DeepSeekCoder
), Coder-Base-6.7B (-a DeepSeekCoder
)
- DeepSeek: Coder-Base-1.3B (
-
DeepSeek (
DeepseekV2ForCausalLM
)- Coder-V2-Base (💣 not tested), Coder-V2-Lite-Base
-
Mistral (
MistralForCausalLM
,MixtralForCausalLM
)- Mistral: Base-7B-v0.1, Base-7B-v0.3
-
Gemma (
GemmaForCausalLM
) -
Grok-1
-
StarCoder (
Starcoder2ForCausalLM
) -
Stable-LM (
StableLMEpochModel
)
-
Text Embedding (
XLMRobertaModel
)-
BGE-M3 (
-a BGE-M3
)Note: Only dense embedding is implemented.
-
QA Ranking (
XLMRobertaForSequenceClassification
)- BCE-ReRanker
- BGE-ReRanker-M3 (
-a BGE-Reranker-M3
)
These LoRA models have been tested:
-
Meta-AI multi-token prediction models checkpoints
Download at least one multi-token prediction checkpoint (such as 7B_1T_4). Assume it is stored at /path/to/llama-multi-predict/7B_1T_4. Make sure
tokenizer.model
is downloaded to /path/to/llama-multi-predict.To convert it with
-a llama-multi-token-prediction-ckpt
:python convert.py -i /path/to/llama-multi-predict/7B_1T_4 -o llama-multi.bin -a llama-multi-token-prediction-ckpt
This is a base model, and remember to use
--format completion
.Tip: Use
--kv n_future_tokens N
to change number of future tokens, N = [1, 4].