feat: metrics instrumentation #2131

jonathanj-square · 2024-07-22T21:14:58Z

No description provided.

wesbillman

I like the approach of a centralized place for metrics details 👍

wesbillman · 2024-07-22T22:04:13Z

backend/controller/observability/observability.go

+	"github.com/TBD54566975/ftl/internal/model"
+)
+
+const name = "ftl.xyz/ftl/runner"


I'm not sure this is correct. Many of these metrics can come from the controller if I'm reading this correctly. I also think runner specific metrics should be handled in backend/runner/ packages probably.

A couple questions from me as well:

What's xyz?

We don't really want /s in here, do we?

+1 it's weird to have runner logic in the controller directory

Unless this is just a placeholder, in which case feel free to ignore :)

dropped and went with other naming feedback (e.g. fsm.call)

the xyz is a domain that I've seen used in conjunction with ftl. The domain/path was motivated by some of the example otel integration code I learned from while ramping. I personally prefer the naming scheme that you two have suggested here.

wesbillman · 2024-07-22T22:06:59Z

backend/controller/observability/observability.go

+	metrics = observableMetrics{
+		meter: otel.Meter(name),
+		// TODO: move to a initialization method


I might be missing something here, but based on how I read the spreadsheet for these, it seems like we would have a meter called something like ftl.call then that meter would contain counter/etc. like requests and latency. I'm still trying to make the names/columns in the spreadsheet to their counterparts here.

change applied w/ the observability.go file decomposition into feature specific files.

wesbillman · 2024-07-23T16:47:14Z

@jonathanj-square I wonder if we can make a simplified version of this that won't require init(), etc. I was thinking something like this:

Folder structure might be:

- backend/controller
  - observability
    - call.go
    - pubsub.go
    - fsm.go
    - etc....

Essentially, 1 file for each "signal" in the spreadsheet.

Then for each of these files we could define a meter and use it for all the metrics. This is a very basic not complete example based on the metric @deniseli recently added as a test.

var meter = otel.GetMeterProvider().Meter("ftl.call")

func RecordCallRequest(ctx context.Context, req *connect.Request[ftlv1.CallRequest]) {
	logger := log.FromContext(ctx)
	requestCounter, err := meter.Int64Counter(
		"ftl.call.requests",
		metric.WithDescription("Count of FTL verb calls via the controller"))

	if err != nil {
		logger.Errorf(err, "Failed to instrument otel metric `ftl.call.requests`")
		return
	}

	requestCounter.Add(ctx, 1, metric.WithAttributes(
		attribute.String("ftl.module.name", req.Msg.Verb.Module),
		attribute.String("ftl.verb.name", req.Msg.Verb.Name),
	))
}

This approach might help us easily add new metrics and scan through each "feature" to see what metrics are recorded and what attributes they have.

backend/controller/controller.go

deniseli · 2024-07-23T18:52:11Z

backend/controller/observability/observability.go

+
+	"github.com/alecthomas/types/optional"
+
+	"github.com/TBD54566975/ftl/backend/controller/dal"


we'll need to fix this import cycle, need to think through how the packages should flow into each other

deniseli · 2024-07-23T18:56:03Z

backend/controller/observability/observability.go

+	"github.com/TBD54566975/ftl/internal/model"
+)
+
+const name = "ftl.xyz/ftl/runner"


A couple questions from me as well:

What's xyz?

We don't really want /s in here, do we?

+1 it's weird to have runner logic in the controller directory

Unless this is just a placeholder, in which case feel free to ignore :)

deniseli · 2024-07-23T19:06:46Z

backend/controller/observability/observability.go

+	CallError     optional.Option[error]
+}
+
+func init() {


It'd be nice not having metrics across so many different domains bundled into this same file. This file is going to get pretty dang long over time, and then someone will have to refactor it into separate smaller files anyway.

ahhhh I totally missed @wesbillman 's sweet comment here when I first wrote this. I love that idea! Lesdoit

sounds good, breaking this up now

deniseli · 2024-07-23T19:09:39Z

backend/controller/observability/observability.go

+func init() {
+	metrics.calls.requests, _ = metrics.meter.Int64Counter("ftl.call.requests",
+		metric.WithDescription("number of verb calls"),
+		metric.WithUnit("{count}"))


The unit names all have to come from here so I'm pretty sure we have to leave the counts empty: https://ucum.org/ucum#section-Alphabetic-Index-By-Name

deniseli · 2024-07-23T19:17:48Z

backend/controller/observability/observability.go

+		attributes: metricAttributeBuilders{
+			moduleName: func(name string) attribute.KeyValue {
+				return attribute.String("ftl.module.name", name)
+			},
+			featureName: func(name string) attribute.KeyValue {
+				return attribute.String("ftl.feature.name", name)
+			},
+			destinationVerb: func(name string) attribute.KeyValue {
+				return attribute.String("ftl.verb.dest", name)
+			},
+		},


It feels like a lot of this could be simplified/shortened to just const declarations of the attr names. Since below, it's already a full line, we would save a bit of complexity here and not lose anything below. Ex:

moduleAttr := metrics.attributes.moduleName(fsm.Module)

to

moduleAttr := attribute.String(ATTR_MODULE, fsm.Module)

It's nice having the attributes associated with their metrics, but it's already associated below since we need to actually instantiate all the attrs below anyway. So we don't need to duplicate those associations up here

There is a difference between the two options worth noting. Attributes have a type associated with them and the second option decouples name and type. Attributes do have a schema to them and the proposal opens up the possibility of accidental type mismatching - I'm not sure where the break would end up but since there is no schema registration up front any breakage would occur at run-time or collection time. The first option is meant to defend against that error class.

Great point. Fortunately, it looks like the typecheck is built into the attr package :) https://pkg.go.dev/go.opentelemetry.io/otel/attribute#String

I'm pretty sure that should fail to build but let me confirm

yeah confirmed:

backend/controller/observability/call.go:36:44: cannot use "msg" (untyped string constant) as int value in argument to attribute.Int

That doesn't quite capture the schema concern. My concern is more along the lines of...

// call site #1
errorAttr := attribute.String("ftl.error.code", "BAD_REQUEST")

// call site #2
errorAttr := attribute.Int64("ftl.error.code", 400)

This should not yield a compilation failure

Got it. Helper functions do make sense for that. Let's just make sure we're only creating those helper functions once for each attribute. My main concern here was with trying to capture the associations of metrics with attributes in the observableMetrics struct.

deniseli · 2024-07-23T19:19:51Z

backend/controller/observability/observability.go

+type metricAttributeBuilders struct {
+	moduleName      func(name string) attribute.KeyValue
+	featureName     func(name string) attribute.KeyValue
+	destinationVerb func(name string) attribute.KeyValue
+}
+
+type callMetrics struct {
+	meter    metric.Meter
+	requests metric.Int64Counter
+	failures metric.Int64Counter
+	active   metric.Int64UpDownCounter
+	latency  metric.Int64Histogram
+}
+
+type fsmMetrics struct {
+	meter       metric.Meter
+	active      metric.Int64UpDownCounter
+	transitions metric.Int64Counter
+}
+
+type observableMetrics struct {
+	attributes metricAttributeBuilders
+	calls      *callMetrics
+	fsm        *fsmMetrics
+}


Quite a bit of this is duplicative as well, so we could just get rid of the structs. We need to instantiate the metrics below anyway, so this is just another site we'd need to add a line whenever we need to add a new metric.

yep good call, I'll incorporate that input as I break out the metrics into individual feature based metrics (e.g. observability/calls.go, etc)

deniseli · 2024-07-23T19:34:54Z

Related to @wesbillman 's comment above: here's what I put together real quick locally to play with using attrs for success/failure: https://github.com/TBD54566975/ftl/compare/dli/call-counts?expand=1

It's convenient being able to pass arrays of attributes around for logging the separate variant states of each metric. So it's beyond the scope of what you have now but we'll probably want to incorporate that before merging

- request count - failure count - latency (regardless of outcome)

deniseli · 2024-07-23T22:36:27Z

backend/controller/dal/fsm.go

@@ -57,11 +58,15 @@ func (d *DAL) StartFSMTransition(ctx context.Context, fsm schema.RefKey, executi
 		}
 		return fmt.Errorf("failed to start FSM transition: %w", err)
 	}
+
+	observability.RecordFsmTransitionBegin(ctx, fsm)


nit: go convention is to fully capitalize all initialisms, so this would be RecordFSMTransitionBegin. Same elsewhere

deniseli · 2024-07-23T22:37:17Z

backend/controller/observability/calls.go

+	callInitOnce.Do(func() {
+		callRequests, _ = callMeter.Int64Counter("ftl.call.requests",
+			metric.WithDescription("number of verb calls"),
+			metric.WithUnit("{count}"))


see previous comment on units

deniseli · 2024-07-23T22:38:01Z

backend/controller/observability/calls.go

+		callRequests, _ = callMeter.Int64Counter("ftl.call.requests",
+			metric.WithDescription("number of verb calls"),
+			metric.WithUnit("{count}"))
+		callFailures, _ = callMeter.Int64Counter("ftl.call.failures",


Weren't we going to make this an attr of request instead of a separate metric?

deniseli · 2024-07-23T22:40:41Z

backend/controller/observability/calls.go

+}
+
+func initCallMetrics() {
+	callInitOnce.Do(func() {


Why did you use a sync.Once here instead of just var callRequests = ... at the package scope?

moved away from initialization via init

while still desiring the ability to initialize once (e.g. due to concern of instrumentation carrying potentially expensive initialization operations)

the initialization errors will get logged later (right now they are ignored)

deniseli · 2024-07-23T22:43:08Z

backend/controller/observability/observability.go

+type metricAttributeBuilders struct {
+	moduleName      func(name string) attribute.KeyValue
+	featureName     func(name string) attribute.KeyValue
+	destinationVerb func(name string) attribute.KeyValue
+}


This is a bit excessive, could just do:

func attrModuleName(name string) attribute.KeyValue {...} func attrFeatureName(name string) attribute.KeyValue {...} ...

Otherwise, it's just more lines to change/maintain.

switched to explicit functions in the other PR

deniseli · 2024-07-23T22:47:34Z

backend/runner/runner.go

@@ -107,6 +117,18 @@ func Start(ctx context.Context, config Config) error {
 	go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.RegisterRunner, svc.registrationLoop)
 	go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.StreamDeploymentLogs, svc.streamLogsLoop)

+	instanceCounter, err = meter.Int64UpDownCounter("ftl.sys.runner.instance",


Let's make all the metric names {meter}.{counter}. Rather than hardcode that naming pattern, we can just set var meterName in each individual otel file, and then all the counter names can be constructed as: fmt.Sprintf("%s.requests", meterName) (subbing request for whatever the counter name is)

applied in new branch

deniseli · 2024-07-23T22:48:14Z

backend/runner/runner.go

+		metric.WithUnit("{count}"))
+
+	if err != nil {
+		panic(err)


We probably don't want to crash the whole runner when otel instrumentation fails. Could you take a look at how the existing code users the logger from ctx?

noted, this meter is also no longer in scope. will apply this feedback in other metric inits

deniseli · 2024-07-23T22:48:26Z

backend/runner/runner.go

+		panic(err)
+	}
+
+	moduleNameAttribute := attribute.String("ftl.module.name", "unknown-module")


This should use your helper function to guarantee consistency, right?

deniseli · 2024-07-23T22:49:03Z

cmd/ftl/cmd_serve.go

+	if err := observability.Init(ctx, "ftl-dev", ftl.Version, s.ObservabilityConfig); err != nil {
+		return fmt.Errorf("failed to initialize observability: %w", err)
+	}
+


no longer necessary since Saf's change merged

deniseli · 2024-07-23T22:52:15Z

backend/controller/observability/calls.go

+	var featureName attribute.KeyValue
+	var moduleName attribute.KeyValue
+	if len(call.Callers) > 0 {
+		featureName = metricAttributes.featureName(call.Callers[0].Name)
+		moduleName = metricAttributes.moduleName(call.Callers[0].Module)
+	} else {
+		featureName = metricAttributes.featureName("unknown")
+		moduleName = metricAttributes.moduleName("unknown")
+	}


moduleNameAttr := attrModuleName(lookupModuleName(call.Callers))

or even simpler:

callActive.Add(ctx, 1, metric.WithAttributes( attrModuleName(lookupModuleName(call.Callers)), /* rest of the attrs */))

wesbillman · 2024-07-23T23:06:35Z

backend/controller/observability/calls.go

+func RecordCallBegin(ctx context.Context, call *CallBegin) {
+	initCallMetrics()


I think I'm missing the context on why we need to have an init in these metric funcs. It would be awesome if they were just plain old functions that could define the metric and attributes, without having to initialize the other structures. It might also remove the need for sync.Once code as well. Maybe we can sync on why the init stuff is required.

The otel getting started guides pre-initialize their metrics and use the initialized metrics in their instrumentation. The initialization metric process is a black box (for me at least) so my preference is to avoid risking introducing heavy weight operations in instrumentation code

Gotcha! I'm cool with that approach too. I might suggest having like a New function on these and maybe an overall New that calls the individual "feature" metric files. Then we can just New it up when we need it vs. having to have these init funcs everywhere.

wesbillman · 2024-07-30T16:06:08Z

@jonathanj-square do you still want this branch now that we have the other implementations?

github-actions bot changed the title ~~metrics instrumentation~~ feat: metrics instrumentation Jul 22, 2024

ftl-robot mentioned this pull request Jul 22, 2024

Dashboard #728

Open

wesbillman reviewed Jul 22, 2024

View reviewed changes

deniseli reviewed Jul 23, 2024

View reviewed changes

jonathanj-square added 5 commits July 23, 2024 14:55

instrumenting the ftl.sys.runner.instance counter.

d3b5d0d

instrumenting call metrics with

f676b7a

- request count - failure count - latency (regardless of outcome)

call and fsm metric instrumentation (still WIP)

80a5330

using individual meters for each feature

42f9937

break out metrics into feature specific bundles

c377c8a

jonathanj-square force-pushed the jonathanj/otel/e2e_metric branch from b38d307 to c377c8a Compare July 23, 2024 21:56

deniseli reviewed Jul 23, 2024

View reviewed changes

wesbillman reviewed Jul 23, 2024

View reviewed changes

jonathanj-square closed this Aug 1, 2024

jonathanj-square deleted the jonathanj/otel/e2e_metric branch August 1, 2024 18:27


		"github.com/alecthomas/types/optional"

		"github.com/TBD54566975/ftl/backend/controller/dal"

		func RecordCallBegin(ctx context.Context, call *CallBegin) {
		initCallMetrics()

feat: metrics instrumentation #2131

feat: metrics instrumentation #2131

Conversation

jonathanj-square commented Jul 22, 2024

wesbillman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesbillman commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniseli commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesbillman commented Jul 30, 2024