- Author(s): @markdroth
- Approver: @ejona86, @dfawley
- Status: {Draft, In Review, Ready for Implementation, Implemented}
- Implemented in: <language, ...>
- Last updated: 2024-09-27
- Discussion at: https://groups.google.com/g/grpc-io/c/xawJAoE-pQE
We are adding xDS configuration to control which fields get propagated from ORCA backend metric reports to LRS load reports.
In gRFC A64, we introduced support for propagating backend metrics
from ORCA load reports (see gRFC A51) to LRS load reports.
That design says that we automatically propagate all entries in the
ORCA named_metrics
map to LRS. That approach has been found to be
insufficiently flexible, so we are introducing configuration to explicitly
control which fields are propagated.
In envoyproxy/envoy#35210 and envoyproxy/envoy#35284, a new
lrs_report_endpoint_metrics
field was added in CDS to control the set
of fields to propagate from ORCA to LRS. In addition, new top-level
fields were added to LRS to efficiently support top-level metrics in ORCA.
We will honor all of this configuration in the gRPC client.
Note that this will change the gRPC client default behavior in two ways:
- With this change, if the
lrs_report_endpoint_metrics
field is unset in the CDS resource, nothing will be propagated from ORCA to LRS. To include the same metrics as before, the CDS resource must explicitly addnamed_metrics.*
tolrs_report_endpoint_metrics
. - If keys from the ORCA
named_metrics
map are propagated to LRS by the CDSlrs_report_endpoint_metrics
field, the key names in LRS will be prepended withnamed_metrics.
(e.g., if there is a key in the ORCAnamed_metrics
map calledfoo
, then it will appear in LRS asnamed_metrics.foo
).
The specific changes needed in gRPC implementations are described in the following subsections.
The new lrs_report_endpoint_metrics
field will be validated when
receiving a CDS resource, and the parsed form of this field will be
stored in the parsed form of the CDS resource that is provided to
XdsClient watchers.
To reduce memory usage, implementations may choose a more compact form than the original set of strings. For example, C-core will use the following representation:
struct BackendMetricPropagation {
static constexpr uint8_t kCpuUtilization = 1;
static constexpr uint8_t kMemUtilization = 2;
static constexpr uint8_t kApplicationUtilization = 4;
static constexpr uint8_t kNamedMetricsAll = 8;
// Bit mask of top-level fields to propagate.
uint8_t propagation_bits = 0;
// Keys from named_metrics map to propagate.
// Used only if kNamedMetricsAll is not set in propagation_bits.
absl::flat_hash_set<std::string> named_metric_keys;
};
For forward-compatibility, implementations should ignore any strings that do not match a currently supported ORCA field. Implementations must support the following minimum set of fields:
cpu_utilization
mem_utilization
application_utilization
named_metrics map
The LRS API in XdsClient (or LrsClient, if the implementation has split the LRS APIs into their own interface) currently provides a way to obtain a locality stats handle that is keyed by LRS server, CDS and EDS resource names, and locality. We will add the propagation bits as another dimension of this key.
When the xds_cluster_impl
LB policy reports LRS metrics to the locality
stats handle at the end of an RPC, it will pass the ORCA data (if any).
Because the locality stats handle already has the propagation
configuration, it will know which ORCA fields to copy.
As an example, in C-core, we have the following APIs in LrsClient:
class ClusterLocalityStats {
public:
// ...
// Reports metrics for a finished RPC.
// backend_metrics points to the reported ORCA metrics, or null if none
// were reported.
void AddCallFinished(const BackendMetricData* backend_metrics,
bool fail = false);
};
// Adds locality stats for cluster_name and eds_service_name for the
// specified locality with the specified backend metric propagation.
RefCountedPtr<ClusterLocalityStats> AddClusterLocalityStats(
std::shared_ptr<const XdsBootstrap::XdsServer> lrs_server,
absl::string_view cluster_name, absl::string_view eds_service_name,
RefCountedPtr<XdsLocalityName> locality,
RefCountedPtr<const BackendMetricPropagation> backend_metric_propagation);
When populating the LRS protos, if the propagation configuration says to
include any of the top-level fields (cpu_utilization
,
mem_utilization
, or application_utilization
), then those will be
propagated to the new LRS fields of the same names in the
UpstreamLocalityStats
message.
This option (both reading the new field in the CDS resource
and the new propagation behavior) will be guarded by the
GRPC_EXPERIMENTAL_XDS_ORCA_LRS_PROPAGATION
env var until it passes
interop tests.
Because this feature involves a change to gRPC's default behavior, we will retain this env var for at least a couple of releases as a temporary mechanism to get the old behavior back. The default for the env var will be set to true once it passes interop tests, but users will be able to explicitly set it to false to get the old behavior.
Storing the propagation info inside the locality stats handle from
the XdsClient or LrsClient is more efficient than the perhaps slightly
more obvious solution of handling this in the xds_cluster_impl
policy
itself, since that would require storing the propagation information
both per-subchannel and per-call, which would increase memory usage.
C-core implementation in grpc/grpc#37467.
Will also be implemented in Java and Go.