You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This RFC is to propose an unified FP8 usage interface between vLLM and third party quantizers (e.g. AMD quantizer, nVIDIA AMMO, etc.) or any harness built around them.
Goal is to have vLLM work on a simple, clearly defined and concise interface over FP8 features.
Scope:
per tensor scaling is supported
per channel scaling in the future
per block scaling in the future
External quantizer supported:
AMD quantizer
AMMO
Interface Content (example in schema)
N.B. This vLLM interface below isn't necessarily a specific quantizer's direct output.
Integer key below is the transformer layer index, starting 0 as the first defined in model.
Only OCP FP8 E4M3 is recommended.
"_comment", below is for documentation purpose.
{
"_comment": "intent of this schema design is not to support llama only"
"model_type": "llama",
"kv_cache": {
"_comment": "have kv_cache session separate is to enable it as a standalone feature, other than everything else"
"dtype": "float8_e4m3fn",
"scaling_factor": {
"_comment": "this section exists for kv cache in fp8 e4m3fn"
"rank0": {
"_comment": "integer index key below is the layer id defined in model, 0 as the first"
"0": 0.05,
"1": 0.04,
... ...
"31": 0.1
},
"rank1": {
... ...
},
... ...
}
},
"mlp": {
"activation": {
"dtype": "float16" | "bfloat16" | "float8_e4m3fn",
"_comment": "for float16/bfloat16 input X, scaling factor below is used to do e4m3fn quant, then its inverse",
"_comment": "is used at gemm output; for float8_e4m3fn input X, scaling factor is used at gemm output",
"target_dtype": "float8_e4m3fn",
"scaling_factor": {
"_comment": "this section exists for fp8 mfma"
"rank0": {
"0": {
"_comment": "keys: gate_proj, up_proj, down_proj or fc1, fc2 best to be suffix for those defined in HF model"
"gate_proj": 0.04,
"up_proj": 0.04,
"down_proj": 0.03
},
"1": {
"_comment": "for activation/X, scaling factors to gate_proj and up_proj should be identical, as same X"
"gate_proj": 0.06,
"up_proj": 0.06,
"down_proj": 0.04
},
... ...
"31": {
"_comment": "but we do not merge it to one gate_up_proj, to be smarter than quantizer, and to do differently than Weight section below"
"gate_proj": 0.03,
"up_proj": 0.03,
"down_proj": 0.05
}
},
"rank1": {
... ...
},
... ...
}
},
"weight": {
"dtype": "float16" | "bfloat16" | "float8_e4m3fn" | "int8" | "int4",
"_comment": "for float16/bfloat16 W, scaling factor below is used to do e4m3fn quant, then its inverse",
"_comment": "is used at gemm output; for float8_e4m3fn W, scaling factor is used at gemm output;",
"_comment": "for int4 W, scaling factor is a compound one [scale2half `*` scale2fp8] to do e4m3fn quant,"
"_comment": "then to be used at gemm output",
"target_dtype": "float8_e4m3fn",
"scaling_factor": {
"_comment": "this section exists for fp8 mfma"
"rank0": {
"0": {
"gate_proj": 0.03,
"up_proj": 0.02,
"down_proj": 0.07
},
"1": {
"gate_proj": 0.02,
"up_proj": 0.04,
"down_proj": 0.03
},
... ...
"31": {
"gate_proj": 0.05,
"up_proj": 0.06,
"down_proj": 0.04
}
},
"rank1": {
... ...
},
... ...
},
},
"attention": {
"_comment": "to add fp8 compute to attention layers later (after mlp)"
},
"quantized_weights": {
"rank0": "path_to/rank0.safetensors",
"rank1": "path_to/rank1.safetensors",
... ...
}
}
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This RFC is to propose an unified FP8 usage interface between vLLM and third party quantizers (e.g. AMD quantizer, nVIDIA AMMO, etc.) or any harness built around them.
Goal is to have vLLM work on a simple, clearly defined and concise interface over FP8 features.
Scope:
External quantizer supported:
Interface Content (example in schema)
N.B. This vLLM interface below isn't necessarily a specific quantizer's direct output.
Integer key below is the transformer layer index, starting 0 as the first defined in model.
Only OCP FP8 E4M3 is recommended.
"_comment"
, below is for documentation purpose.Reference
RFC: FP8 in vLLM #2461
Beta Was this translation helpful? Give feedback.
All reactions