- Open-source model: Gemma.
- Continuous Batching experimental support.
- Open-source model: LLaMA 2.
- Open-source improvement: GPT-J tokenizer.
- Open-source models: LLaMA and GPT-J.
- Improved compatibility with new Cloud TPU systems.
- Fixed multi-host TPU models.
- Fixed single-host GPU models.
- Google Cloud GPU support.
- Model quantization.
- Streaming lm.generate.
- A new custom model server type.
- PyTorch model servers.
- ACL settings on models and cells.
- A Pax model server that supports Google Cloud TPU slice serving.
- An admin server that manages model servers.
- Go, C++, and Python clients to manage and use models.
- A command-line tool
saxutil
. - Example language and vision model serving parameters.