A low latency & high throughput stable diffusion serving system integrated with most advanced features.
- SLOs-aware iteration scheduling.
- multi-model/LoRA concurrent serving.
- Co-schedule inference/finetune task.
- Low bit optimization (fp16 for recommend)
- xFormers, Toolbox to Accelerate Research on transformers, developed by Meta AI.
- DeepSpeed, Extreme Speed and Scale for DL Training and Inference, developed by Microsoft Research.
- OneFlow, a deep learning framework designed to be user-friendly, scalable and efficient.
- Support stable diffusion checkpoints/LoRAs on civitai.
- Machine Learning Compilation optimization.
For HuggingFace diffusers pipeline, xFormers, DeepSpeed, use env/base-env.yaml
. Complete DeepSpeed acceleration relies on CUTLASS installation.
For OneFlow, use env/OneFlow.yaml
, after installation, replace diffusers.models.unet_2d_condition.forward
function with code in src/unet_forward_with_different_timesteps.py
.
Numbers are collected on Ubuntu 20.04.6 LTS with RTX 4090 24GB, CUDA=11.8.
All settings use fp16. (Experiment results show that there is no obvious quality loss when using fp16, compared to fp32)
Inference setting:
{
"prompt": "an astronaut riding horse on the moon",
"num_inference_steps": 50,
"height": 512,
"width": 512,
"guidance_scale": 7.5
}
Note, If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
batch_size | PyTorch=2.1.0+diffusers=0.14.0 | OneFlow=0.9.0 | xFormers=0.0.22 | DeepSpeed=0.12.2 |
---|---|---|---|---|
1 | 1.660253 | 0.907718 | 1.837109 | 1.444413 |
2 | 2.154117 | 1.481392 | 2.294451 | 2.094967 |
4 | 3.949180 | 2.621291 | 4.086211 | 3.907683 |
8 | 7.741389 | 5.011853 | 7.610301 | 7.674476 |