-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: metrics instrumentation #2131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of a centralized place for metrics details 👍
"github.com/TBD54566975/ftl/internal/model" | ||
) | ||
|
||
const name = "ftl.xyz/ftl/runner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct. Many of these metrics can come from the controller if I'm reading this correctly. I also think runner
specific metrics should be handled in backend/runner/
packages probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions from me as well:
- What's xyz?
- We don't really want
/
s in here, do we? - +1 it's weird to have runner logic in the controller directory
Unless this is just a placeholder, in which case feel free to ignore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropped and went with other naming feedback (e.g. fsm.call
)
the xyz is a domain that I've seen used in conjunction with ftl. The domain/path was motivated by some of the example otel integration code I learned from while ramping. I personally prefer the naming scheme that you two have suggested here.
metrics = observableMetrics{ | ||
meter: otel.Meter(name), | ||
// TODO: move to a initialization method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something here, but based on how I read the spreadsheet for these, it seems like we would have a meter
called something like ftl.call
then that meter would contain counter/etc. like requests
and latency
. I'm still trying to make the names/columns in the spreadsheet to their counterparts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change applied w/ the observability.go file decomposition into feature specific files.
@jonathanj-square I wonder if we can make a simplified version of this that won't require Folder structure might be:
Essentially, 1 file for each "signal" in the spreadsheet. Then for each of these files we could define a var meter = otel.GetMeterProvider().Meter("ftl.call")
func RecordCallRequest(ctx context.Context, req *connect.Request[ftlv1.CallRequest]) {
logger := log.FromContext(ctx)
requestCounter, err := meter.Int64Counter(
"ftl.call.requests",
metric.WithDescription("Count of FTL verb calls via the controller"))
if err != nil {
logger.Errorf(err, "Failed to instrument otel metric `ftl.call.requests`")
return
}
requestCounter.Add(ctx, 1, metric.WithAttributes(
attribute.String("ftl.module.name", req.Msg.Verb.Module),
attribute.String("ftl.verb.name", req.Msg.Verb.Name),
))
} This approach might help us easily add new metrics and scan through each "feature" to see what metrics are recorded and what attributes they have. |
|
||
"github.com/alecthomas/types/optional" | ||
|
||
"github.com/TBD54566975/ftl/backend/controller/dal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to fix this import cycle, need to think through how the packages should flow into each other
"github.com/TBD54566975/ftl/internal/model" | ||
) | ||
|
||
const name = "ftl.xyz/ftl/runner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions from me as well:
- What's xyz?
- We don't really want
/
s in here, do we? - +1 it's weird to have runner logic in the controller directory
Unless this is just a placeholder, in which case feel free to ignore :)
CallError optional.Option[error] | ||
} | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice not having metrics across so many different domains bundled into this same file. This file is going to get pretty dang long over time, and then someone will have to refactor it into separate smaller files anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh I totally missed @wesbillman 's sweet comment here when I first wrote this. I love that idea! Lesdoit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, breaking this up now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
func init() { | ||
metrics.calls.requests, _ = metrics.meter.Int64Counter("ftl.call.requests", | ||
metric.WithDescription("number of verb calls"), | ||
metric.WithUnit("{count}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit names all have to come from here so I'm pretty sure we have to leave the counts empty: https://ucum.org/ucum#section-Alphabetic-Index-By-Name
attributes: metricAttributeBuilders{ | ||
moduleName: func(name string) attribute.KeyValue { | ||
return attribute.String("ftl.module.name", name) | ||
}, | ||
featureName: func(name string) attribute.KeyValue { | ||
return attribute.String("ftl.feature.name", name) | ||
}, | ||
destinationVerb: func(name string) attribute.KeyValue { | ||
return attribute.String("ftl.verb.dest", name) | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like a lot of this could be simplified/shortened to just const
declarations of the attr names. Since below, it's already a full line, we would save a bit of complexity here and not lose anything below. Ex:
moduleAttr := metrics.attributes.moduleName(fsm.Module)
to
moduleAttr := attribute.String(ATTR_MODULE, fsm.Module)
It's nice having the attributes associated with their metrics, but it's already associated below since we need to actually instantiate all the attrs below anyway. So we don't need to duplicate those associations up here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between the two options worth noting. Attributes have a type associated with them and the second option decouples name and type. Attributes do have a schema to them and the proposal opens up the possibility of accidental type mismatching - I'm not sure where the break would end up but since there is no schema registration up front any breakage would occur at run-time or collection time. The first option is meant to defend against that error class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Fortunately, it looks like the typecheck is built into the attr package :) https://pkg.go.dev/go.opentelemetry.io/otel/attribute#String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that should fail to build but let me confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah confirmed:
backend/controller/observability/call.go:36:44: cannot use "msg" (untyped string constant) as int value in argument to attribute.Int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Helper functions do make sense for that. Let's just make sure we're only creating those helper functions once for each attribute. My main concern here was with trying to capture the associations of metrics with attributes in the observableMetrics
struct.
type metricAttributeBuilders struct { | ||
moduleName func(name string) attribute.KeyValue | ||
featureName func(name string) attribute.KeyValue | ||
destinationVerb func(name string) attribute.KeyValue | ||
} | ||
|
||
type callMetrics struct { | ||
meter metric.Meter | ||
requests metric.Int64Counter | ||
failures metric.Int64Counter | ||
active metric.Int64UpDownCounter | ||
latency metric.Int64Histogram | ||
} | ||
|
||
type fsmMetrics struct { | ||
meter metric.Meter | ||
active metric.Int64UpDownCounter | ||
transitions metric.Int64Counter | ||
} | ||
|
||
type observableMetrics struct { | ||
attributes metricAttributeBuilders | ||
calls *callMetrics | ||
fsm *fsmMetrics | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a bit of this is duplicative as well, so we could just get rid of the structs. We need to instantiate the metrics below anyway, so this is just another site we'd need to add a line whenever we need to add a new metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep good call, I'll incorporate that input as I break out the metrics into individual feature based metrics (e.g. observability/calls.go, etc)
Related to @wesbillman 's comment above: here's what I put together real quick locally to play with using attrs for success/failure: https://github.com/TBD54566975/ftl/compare/dli/call-counts?expand=1 It's convenient being able to pass arrays of attributes around for logging the separate variant states of each metric. So it's beyond the scope of what you have now but we'll probably want to incorporate that before merging |
- request count - failure count - latency (regardless of outcome)
b38d307
to
c377c8a
Compare
@@ -57,11 +58,15 @@ func (d *DAL) StartFSMTransition(ctx context.Context, fsm schema.RefKey, executi | |||
} | |||
return fmt.Errorf("failed to start FSM transition: %w", err) | |||
} | |||
|
|||
observability.RecordFsmTransitionBegin(ctx, fsm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: go convention is to fully capitalize all initialisms, so this would be RecordFSMTransitionBegin
. Same elsewhere
callInitOnce.Do(func() { | ||
callRequests, _ = callMeter.Int64Counter("ftl.call.requests", | ||
metric.WithDescription("number of verb calls"), | ||
metric.WithUnit("{count}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment on units
callRequests, _ = callMeter.Int64Counter("ftl.call.requests", | ||
metric.WithDescription("number of verb calls"), | ||
metric.WithUnit("{count}")) | ||
callFailures, _ = callMeter.Int64Counter("ftl.call.failures", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weren't we going to make this an attr of request instead of a separate metric?
} | ||
|
||
func initCallMetrics() { | ||
callInitOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use a sync.Once here instead of just var callRequests = ...
at the package scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- moved away from initialization via
init
- while still desiring the ability to initialize once (e.g. due to concern of instrumentation carrying potentially expensive initialization operations)
- the initialization errors will get logged later (right now they are ignored)
type metricAttributeBuilders struct { | ||
moduleName func(name string) attribute.KeyValue | ||
featureName func(name string) attribute.KeyValue | ||
destinationVerb func(name string) attribute.KeyValue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit excessive, could just do:
func attrModuleName(name string) attribute.KeyValue {...}
func attrFeatureName(name string) attribute.KeyValue {...}
...
Otherwise, it's just more lines to change/maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to explicit functions in the other PR
@@ -107,6 +117,18 @@ func Start(ctx context.Context, config Config) error { | |||
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.RegisterRunner, svc.registrationLoop) | |||
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.StreamDeploymentLogs, svc.streamLogsLoop) | |||
|
|||
instanceCounter, err = meter.Int64UpDownCounter("ftl.sys.runner.instance", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make all the metric names {meter}.{counter}
. Rather than hardcode that naming pattern, we can just set var meterName
in each individual otel file, and then all the counter names can be constructed as: fmt.Sprintf("%s.requests", meterName)
(subbing request
for whatever the counter name is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied in new branch
metric.WithUnit("{count}")) | ||
|
||
if err != nil { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want to crash the whole runner when otel instrumentation fails. Could you take a look at how the existing code users the logger from ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted, this meter is also no longer in scope. will apply this feedback in other metric inits
panic(err) | ||
} | ||
|
||
moduleNameAttribute := attribute.String("ftl.module.name", "unknown-module") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use your helper function to guarantee consistency, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
if err := observability.Init(ctx, "ftl-dev", ftl.Version, s.ObservabilityConfig); err != nil { | ||
return fmt.Errorf("failed to initialize observability: %w", err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer necessary since Saf's change merged
var featureName attribute.KeyValue | ||
var moduleName attribute.KeyValue | ||
if len(call.Callers) > 0 { | ||
featureName = metricAttributes.featureName(call.Callers[0].Name) | ||
moduleName = metricAttributes.moduleName(call.Callers[0].Module) | ||
} else { | ||
featureName = metricAttributes.featureName("unknown") | ||
moduleName = metricAttributes.moduleName("unknown") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moduleNameAttr := attrModuleName(lookupModuleName(call.Callers))
or even simpler:
callActive.Add(ctx, 1, metric.WithAttributes(
attrModuleName(lookupModuleName(call.Callers)),
/* rest of the attrs */))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same below
func RecordCallBegin(ctx context.Context, call *CallBegin) { | ||
initCallMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing the context on why we need to have an init
in these metric funcs. It would be awesome if they were just plain old functions that could define the metric and attributes, without having to initialize the other structures. It might also remove the need for sync.Once
code as well. Maybe we can sync on why the init stuff is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The otel getting started guides pre-initialize their metrics and use the initialized metrics in their instrumentation. The initialization metric process is a black box (for me at least) so my preference is to avoid risking introducing heavy weight operations in instrumentation code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! I'm cool with that approach too. I might suggest having like a New
function on these and maybe an overall New
that calls the individual "feature" metric files. Then we can just New
it up when we need it vs. having to have these init funcs everywhere.
@jonathanj-square do you still want this branch now that we have the other implementations? |
No description provided.