-
LLM Serving Runtimes
- Support Speculative Decoding with vLLM runtime [kserve#3800].
- Support LoRA adapters [kserve#3750].
- Support LLM Serving runtimes for TensorRT-LLM, TGI and provide benchmarking comparisons [kserve#3868].
- Support multi-host, multi-GPU inference runtime [kserve#2145].
-
LLM Autoscaling
- Support Model Caching with automatic PV/PVC provisioning [kserve#3869].
- Support Autoscaling settings for serving runtimes.
- Support Autoscaling based on custom metrics [kserve#3561].
-
LLM RAG/Agent Pipeline Orchestration
- Support declarative RAG/Agent workflow using KServe Inference Graph [kserve#3829].
-
Open Inference Protocol extension to GenAI Task APIs
- Community-maintained Open Inference Protocol repo for OpenAI schema [https://docs.google.com/document/d/1odTMdIFdm01CbRQ6CpLzUIGVppHSoUvJV_zwcX6GuaU].
- Support vertical GenAI Task APIs such as embedding, Text-to-Image, Text-To-Code, Doc-To-Text [kserve#3572].
-
LLM Gateway
- Support multiple LLM providers.
- Support token based rate limiting.
- Support LLM router with traffic shaping, fallback, load balancing.
- LLM Gateway observability for metrics and cost reporting
-
Promote
InferenceService
andClusterServingRuntime
/ServingRuntime
CRD to v1- Improve
InferenceService
CRD for REST/gRPC protocol interface - Improve model storage interface
- Deprecate
TrainedModel
CRD and add multiple model support for co-hosting, draft model, LoRA adapters to InferenceService. - Improve YAML UX for predictor and transformer container collocation.
- Close the feature gap between
RawDeployment
andServerless
mode.
- Improve
-
Open Inference Protocol
- Support batching for v2 inference protocol
- Transformer and Explainer v2 inference protocol interoperability
- Improve codec for v2 inference protocol
Reference: Control plane issues, Data plane issues,Serving Runtime issues.
- Create standardized model packaging API
- Improve KServe model server observability with metrics and distributed tracing
- Support batch inference
Reference:Python SDK issues, Storage issues
- Improve
InferenceGraph
spec for replica and concurrency control - Support distributed tracing
- Support gRPC for
InferenceGraph
- Standalone
Transformer
support forInferenceGraph
- Support traffic mirroring node
- Improve
RawDeployment
mode forInferenceGraph
Reference: InferenceGraph issues
- Document KServe ServiceMesh setup with mTLS
- Support programmatic authentication token
- Implement per service level auth
- Add support for SPIFFE/SPIRE identity integration with
InferenceService
Reference: Auth related issues
- Add ModelMesh docs and explain the use cases for classic KServe and ModelMesh
- Unify the data plane v1 and v2 page formats
- Improve v2 data plane docs to tell the story why and what changed
- Clean up the examples in kserve repo and unify them with the website's by creating one source of truth for documentation
- Update any out-of-date documentation and make sure the website as a whole is consistent and cohesive