Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metrics instrumentation #2131

Closed
wants to merge 5 commits into from
Closed

Conversation

jonathanj-square
Copy link
Contributor

No description provided.

@github-actions github-actions bot changed the title metrics instrumentation feat: metrics instrumentation Jul 22, 2024
@ftl-robot ftl-robot mentioned this pull request Jul 22, 2024
Copy link
Collaborator

@wesbillman wesbillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of a centralized place for metrics details 👍

"github.com/TBD54566975/ftl/internal/model"
)

const name = "ftl.xyz/ftl/runner"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct. Many of these metrics can come from the controller if I'm reading this correctly. I also think runner specific metrics should be handled in backend/runner/ packages probably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions from me as well:

  • What's xyz?
  • We don't really want /s in here, do we?
  • +1 it's weird to have runner logic in the controller directory

Unless this is just a placeholder, in which case feel free to ignore :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped and went with other naming feedback (e.g. fsm.call)

the xyz is a domain that I've seen used in conjunction with ftl. The domain/path was motivated by some of the example otel integration code I learned from while ramping. I personally prefer the naming scheme that you two have suggested here.

Comment on lines 48 to 50
metrics = observableMetrics{
meter: otel.Meter(name),
// TODO: move to a initialization method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something here, but based on how I read the spreadsheet for these, it seems like we would have a meter called something like ftl.call then that meter would contain counter/etc. like requests and latency. I'm still trying to make the names/columns in the spreadsheet to their counterparts here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change applied w/ the observability.go file decomposition into feature specific files.

@wesbillman
Copy link
Collaborator

@jonathanj-square I wonder if we can make a simplified version of this that won't require init(), etc. I was thinking something like this:

Folder structure might be:

- backend/controller
  - observability
    - call.go
    - pubsub.go
    - fsm.go
    - etc....

Essentially, 1 file for each "signal" in the spreadsheet.

Then for each of these files we could define a meter and use it for all the metrics. This is a very basic not complete example based on the metric @deniseli recently added as a test.

var meter = otel.GetMeterProvider().Meter("ftl.call")

func RecordCallRequest(ctx context.Context, req *connect.Request[ftlv1.CallRequest]) {
	logger := log.FromContext(ctx)
	requestCounter, err := meter.Int64Counter(
		"ftl.call.requests",
		metric.WithDescription("Count of FTL verb calls via the controller"))

	if err != nil {
		logger.Errorf(err, "Failed to instrument otel metric `ftl.call.requests`")
		return
	}

	requestCounter.Add(ctx, 1, metric.WithAttributes(
		attribute.String("ftl.module.name", req.Msg.Verb.Module),
		attribute.String("ftl.verb.name", req.Msg.Verb.Name),
	))
}

This approach might help us easily add new metrics and scan through each "feature" to see what metrics are recorded and what attributes they have.

backend/controller/controller.go Show resolved Hide resolved

"github.com/alecthomas/types/optional"

"github.com/TBD54566975/ftl/backend/controller/dal"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to fix this import cycle, need to think through how the packages should flow into each other

"github.com/TBD54566975/ftl/internal/model"
)

const name = "ftl.xyz/ftl/runner"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions from me as well:

  • What's xyz?
  • We don't really want /s in here, do we?
  • +1 it's weird to have runner logic in the controller directory

Unless this is just a placeholder, in which case feel free to ignore :)

CallError optional.Option[error]
}

func init() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice not having metrics across so many different domains bundled into this same file. This file is going to get pretty dang long over time, and then someone will have to refactor it into separate smaller files anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh I totally missed @wesbillman 's sweet comment here when I first wrote this. I love that idea! Lesdoit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, breaking this up now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

func init() {
metrics.calls.requests, _ = metrics.meter.Int64Counter("ftl.call.requests",
metric.WithDescription("number of verb calls"),
metric.WithUnit("{count}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit names all have to come from here so I'm pretty sure we have to leave the counts empty: https://ucum.org/ucum#section-Alphabetic-Index-By-Name

Comment on lines 48 to 58
attributes: metricAttributeBuilders{
moduleName: func(name string) attribute.KeyValue {
return attribute.String("ftl.module.name", name)
},
featureName: func(name string) attribute.KeyValue {
return attribute.String("ftl.feature.name", name)
},
destinationVerb: func(name string) attribute.KeyValue {
return attribute.String("ftl.verb.dest", name)
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a lot of this could be simplified/shortened to just const declarations of the attr names. Since below, it's already a full line, we would save a bit of complexity here and not lose anything below. Ex:

moduleAttr := metrics.attributes.moduleName(fsm.Module)

to

moduleAttr := attribute.String(ATTR_MODULE, fsm.Module)

It's nice having the attributes associated with their metrics, but it's already associated below since we need to actually instantiate all the attrs below anyway. So we don't need to duplicate those associations up here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between the two options worth noting. Attributes have a type associated with them and the second option decouples name and type. Attributes do have a schema to them and the proposal opens up the possibility of accidental type mismatching - I'm not sure where the break would end up but since there is no schema registration up front any breakage would occur at run-time or collection time. The first option is meant to defend against that error class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Fortunately, it looks like the typecheck is built into the attr package :) https://pkg.go.dev/go.opentelemetry.io/otel/attribute#String

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that should fail to build but let me confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah confirmed:

backend/controller/observability/call.go:36:44: cannot use "msg" (untyped string constant) as int value in argument to attribute.Int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't quite capture the schema concern. My concern is more along the lines of...

// call site #1
errorAttr := attribute.String("ftl.error.code", "BAD_REQUEST")

// call site #2
errorAttr := attribute.Int64("ftl.error.code", 400)

This should not yield a compilation failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Helper functions do make sense for that. Let's just make sure we're only creating those helper functions once for each attribute. My main concern here was with trying to capture the associations of metrics with attributes in the observableMetrics struct.

Comment on lines 19 to 43
type metricAttributeBuilders struct {
moduleName func(name string) attribute.KeyValue
featureName func(name string) attribute.KeyValue
destinationVerb func(name string) attribute.KeyValue
}

type callMetrics struct {
meter metric.Meter
requests metric.Int64Counter
failures metric.Int64Counter
active metric.Int64UpDownCounter
latency metric.Int64Histogram
}

type fsmMetrics struct {
meter metric.Meter
active metric.Int64UpDownCounter
transitions metric.Int64Counter
}

type observableMetrics struct {
attributes metricAttributeBuilders
calls *callMetrics
fsm *fsmMetrics
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a bit of this is duplicative as well, so we could just get rid of the structs. We need to instantiate the metrics below anyway, so this is just another site we'd need to add a line whenever we need to add a new metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep good call, I'll incorporate that input as I break out the metrics into individual feature based metrics (e.g. observability/calls.go, etc)

@deniseli
Copy link
Contributor

Related to @wesbillman 's comment above: here's what I put together real quick locally to play with using attrs for success/failure: https://github.com/TBD54566975/ftl/compare/dli/call-counts?expand=1

It's convenient being able to pass arrays of attributes around for logging the separate variant states of each metric. So it's beyond the scope of what you have now but we'll probably want to incorporate that before merging

@jonathanj-square jonathanj-square force-pushed the jonathanj/otel/e2e_metric branch from b38d307 to c377c8a Compare July 23, 2024 21:56
@@ -57,11 +58,15 @@ func (d *DAL) StartFSMTransition(ctx context.Context, fsm schema.RefKey, executi
}
return fmt.Errorf("failed to start FSM transition: %w", err)
}

observability.RecordFsmTransitionBegin(ctx, fsm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: go convention is to fully capitalize all initialisms, so this would be RecordFSMTransitionBegin. Same elsewhere

callInitOnce.Do(func() {
callRequests, _ = callMeter.Int64Counter("ftl.call.requests",
metric.WithDescription("number of verb calls"),
metric.WithUnit("{count}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment on units

callRequests, _ = callMeter.Int64Counter("ftl.call.requests",
metric.WithDescription("number of verb calls"),
metric.WithUnit("{count}"))
callFailures, _ = callMeter.Int64Counter("ftl.call.failures",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't we going to make this an attr of request instead of a separate metric?

}

func initCallMetrics() {
callInitOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use a sync.Once here instead of just var callRequests = ... at the package scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. moved away from initialization via init
  2. while still desiring the ability to initialize once (e.g. due to concern of instrumentation carrying potentially expensive initialization operations)
  3. the initialization errors will get logged later (right now they are ignored)

Comment on lines +7 to +11
type metricAttributeBuilders struct {
moduleName func(name string) attribute.KeyValue
featureName func(name string) attribute.KeyValue
destinationVerb func(name string) attribute.KeyValue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit excessive, could just do:

func attrModuleName(name string) attribute.KeyValue {...}
func attrFeatureName(name string) attribute.KeyValue {...}
...

Otherwise, it's just more lines to change/maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to explicit functions in the other PR

@@ -107,6 +117,18 @@ func Start(ctx context.Context, config Config) error {
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.RegisterRunner, svc.registrationLoop)
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.StreamDeploymentLogs, svc.streamLogsLoop)

instanceCounter, err = meter.Int64UpDownCounter("ftl.sys.runner.instance",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all the metric names {meter}.{counter}. Rather than hardcode that naming pattern, we can just set var meterName in each individual otel file, and then all the counter names can be constructed as: fmt.Sprintf("%s.requests", meterName) (subbing request for whatever the counter name is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in new branch

metric.WithUnit("{count}"))

if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to crash the whole runner when otel instrumentation fails. Could you take a look at how the existing code users the logger from ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted, this meter is also no longer in scope. will apply this feedback in other metric inits

panic(err)
}

moduleNameAttribute := attribute.String("ftl.module.name", "unknown-module")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use your helper function to guarantee consistency, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Comment on lines +89 to +92
if err := observability.Init(ctx, "ftl-dev", ftl.Version, s.ObservabilityConfig); err != nil {
return fmt.Errorf("failed to initialize observability: %w", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer necessary since Saf's change merged

Comment on lines +62 to +70
var featureName attribute.KeyValue
var moduleName attribute.KeyValue
if len(call.Callers) > 0 {
featureName = metricAttributes.featureName(call.Callers[0].Name)
moduleName = metricAttributes.moduleName(call.Callers[0].Module)
} else {
featureName = metricAttributes.featureName("unknown")
moduleName = metricAttributes.moduleName("unknown")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moduleNameAttr := attrModuleName(lookupModuleName(call.Callers))

or even simpler:

callActive.Add(ctx, 1, metric.WithAttributes(
    attrModuleName(lookupModuleName(call.Callers)), 
    /* rest of the attrs */))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below

Comment on lines +59 to +60
func RecordCallBegin(ctx context.Context, call *CallBegin) {
initCallMetrics()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing the context on why we need to have an init in these metric funcs. It would be awesome if they were just plain old functions that could define the metric and attributes, without having to initialize the other structures. It might also remove the need for sync.Once code as well. Maybe we can sync on why the init stuff is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The otel getting started guides pre-initialize their metrics and use the initialized metrics in their instrumentation. The initialization metric process is a black box (for me at least) so my preference is to avoid risking introducing heavy weight operations in instrumentation code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I'm cool with that approach too. I might suggest having like a New function on these and maybe an overall New that calls the individual "feature" metric files. Then we can just New it up when we need it vs. having to have these init funcs everywhere.

@wesbillman
Copy link
Collaborator

@jonathanj-square do you still want this branch now that we have the other implementations?

@jonathanj-square jonathanj-square deleted the jonathanj/otel/e2e_metric branch August 1, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants