Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metrics instrumentation #2131

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 15 additions & 9 deletions backend/controller/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"encoding/binary"
"errors"
"fmt"
"github.com/TBD54566975/ftl/backend/controller/observability"
"hash"
"io"
"math/rand"
Expand Down Expand Up @@ -1020,22 +1021,27 @@
ctx = rpc.WithVerbs(ctx, append(callers, verbRef))
headers.AddCaller(req.Header(), schema.RefFromProto(req.Msg.Verb))

observability.RecordCallBegin(ctx, &observability.CallBegin{
DestVerb: verbRef,
Callers: callers,
})

response, err := client.verb.Call(ctx, req)
var resp *connect.Response[ftlv1.CallResponse]
var maybeResponse optional.Option[*ftlv1.CallResponse]
if err == nil {
resp = connect.NewResponse(response.Msg)
maybeResponse = optional.Some(resp.Msg)
}
s.recordCall(ctx, &Call{
deniseli marked this conversation as resolved.
Show resolved Hide resolved
deploymentKey: route.Deployment,
requestKey: requestKey,
startTime: start,
destVerb: verbRef,
callers: callers,
callError: optional.Nil(err),
request: req.Msg,
response: maybeResponse,
observability.RecordCallEnd(ctx, s.dal, &observability.CallEnd{
DeploymentKey: route.Deployment,
RequestKey: requestKey,
StartTime: start,
DestVerb: verbRef,
Callers: callers,
CallError: optional.Nil(err),
Request: req.Msg,
Response: maybeResponse,
})
return resp, err
}
Expand Down Expand Up @@ -1367,7 +1373,7 @@
default:
panic(fmt.Errorf("unsupported async call origin: %v", call.Origin))
}
})

Check failure on line 1376 in backend/controller/controller.go

View workflow job for this annotation

GitHub Actions / Lint

missing return (typecheck)
if err != nil {
return 0, fmt.Errorf("failed to complete async call: %w", err)
}
Expand Down
5 changes: 5 additions & 0 deletions backend/controller/dal/fsm.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"encoding/json"
"errors"
"fmt"
"github.com/TBD54566975/ftl/backend/controller/observability"
"time"

"github.com/alecthomas/types/optional"
Expand Down Expand Up @@ -57,11 +58,15 @@ func (d *DAL) StartFSMTransition(ctx context.Context, fsm schema.RefKey, executi
}
return fmt.Errorf("failed to start FSM transition: %w", err)
}

observability.RecordFsmTransitionBegin(ctx, fsm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: go convention is to fully capitalize all initialisms, so this would be RecordFSMTransitionBegin. Same elsewhere


return nil
}

func (d *DAL) FinishFSMTransition(ctx context.Context, fsm schema.RefKey, instanceKey string) error {
_, err := d.db.FinishFSMTransition(ctx, fsm, instanceKey)
observability.RecordFsmTransitionSuccess(ctx, fsm)
return dalerrs.TranslatePGError(err)
}

Expand Down
63 changes: 0 additions & 63 deletions backend/controller/observability.go

This file was deleted.

133 changes: 133 additions & 0 deletions backend/controller/observability/calls.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
package observability

import (
"context"
"github.com/TBD54566975/ftl/backend/controller/dal"

Check failure on line 5 in backend/controller/observability/calls.go

View workflow job for this annotation

GitHub Actions / Lint

could not import github.com/TBD54566975/ftl/backend/controller/dal (-: import cycle not allowed: import stack: [github.com/TBD54566975/ftl/backend/controller github.com/TBD54566975/ftl/backend/controller/cronjobs github.com/TBD54566975/ftl/backend/controller/dal github.com/TBD54566975/ftl/backend/controller/observability github.com/TBD54566975/ftl/backend/controller/dal]) (typecheck)
ftlv1 "github.com/TBD54566975/ftl/backend/protos/xyz/block/ftl/v1"
"github.com/TBD54566975/ftl/backend/schema"
"github.com/TBD54566975/ftl/internal/log"
"github.com/TBD54566975/ftl/internal/model"
"github.com/alecthomas/types/optional"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
"sync"
"time"
)

var callMeter = otel.Meter("ftl.call")
var callInitOnce = sync.Once{}

var callRequests metric.Int64Counter
var callFailures metric.Int64Counter
var callActive metric.Int64UpDownCounter
var callLatency metric.Int64Histogram

type CallBegin struct {
DestVerb *schema.Ref
Callers []*schema.Ref
}

type CallEnd struct {
DeploymentKey model.DeploymentKey
RequestKey model.RequestKey
StartTime time.Time
DestVerb *schema.Ref
Callers []*schema.Ref
Request *ftlv1.CallRequest
Response optional.Option[*ftlv1.CallResponse]
CallError optional.Option[error]
}

func initCallMetrics() {
callInitOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use a sync.Once here instead of just var callRequests = ... at the package scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. moved away from initialization via init
  2. while still desiring the ability to initialize once (e.g. due to concern of instrumentation carrying potentially expensive initialization operations)
  3. the initialization errors will get logged later (right now they are ignored)

callRequests, _ = callMeter.Int64Counter("ftl.call.requests",
metric.WithDescription("number of verb calls"),
metric.WithUnit("{count}"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment on units

callFailures, _ = callMeter.Int64Counter("ftl.call.failures",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't we going to make this an attr of request instead of a separate metric?

metric.WithDescription("number of verb call failures"),
metric.WithUnit("{count}"))
callActive, _ = callMeter.Int64UpDownCounter("ftl.call.active",
metric.WithDescription("number of in flight calls"),
metric.WithUnit("{count}"))
callLatency, _ = callMeter.Int64Histogram("ftl.call.latency",
metric.WithDescription("verb call latency"),
metric.WithUnit("{ms}"))
})
}

func RecordCallBegin(ctx context.Context, call *CallBegin) {
initCallMetrics()
Comment on lines +59 to +60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing the context on why we need to have an init in these metric funcs. It would be awesome if they were just plain old functions that could define the metric and attributes, without having to initialize the other structures. It might also remove the need for sync.Once code as well. Maybe we can sync on why the init stuff is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The otel getting started guides pre-initialize their metrics and use the initialized metrics in their instrumentation. The initialization metric process is a black box (for me at least) so my preference is to avoid risking introducing heavy weight operations in instrumentation code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I'm cool with that approach too. I might suggest having like a New function on these and maybe an overall New that calls the individual "feature" metric files. Then we can just New it up when we need it vs. having to have these init funcs everywhere.


var featureName attribute.KeyValue
var moduleName attribute.KeyValue
if len(call.Callers) > 0 {
featureName = metricAttributes.featureName(call.Callers[0].Name)
moduleName = metricAttributes.moduleName(call.Callers[0].Module)
} else {
featureName = metricAttributes.featureName("unknown")
moduleName = metricAttributes.moduleName("unknown")
}
Comment on lines +62 to +70
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moduleNameAttr := attrModuleName(lookupModuleName(call.Callers))

or even simpler:

callActive.Add(ctx, 1, metric.WithAttributes(
    attrModuleName(lookupModuleName(call.Callers)), 
    /* rest of the attrs */))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below


destinationVerb := metricAttributes.destinationVerb(call.DestVerb.Name)

callActive.Add(ctx, 1, metric.WithAttributes(moduleName, featureName, destinationVerb))
}

func RecordCallEnd(ctx context.Context, d *dal.DAL, call *CallEnd) {
initCallMetrics()

logger := log.FromContext(ctx)
var sourceVerb optional.Option[schema.Ref]
var featureName attribute.KeyValue
var moduleName attribute.KeyValue

// TODO avoid having to find the source (pass it in `CallEnd` and `CallStart` instead)
if len(call.Callers) > 0 {
sourceVerb = optional.Some(*call.Callers[0])
featureName = metricAttributes.featureName(call.Callers[0].Name)
moduleName = metricAttributes.moduleName(call.Callers[0].Module)
} else {
featureName = metricAttributes.featureName("unknown")
moduleName = metricAttributes.moduleName("unknown")
}

destinationVerb := metricAttributes.destinationVerb(call.DestVerb.Name)

callRequests.Add(ctx, 1, metric.WithAttributes(moduleName, featureName, destinationVerb))
callActive.Add(ctx, -1, metric.WithAttributes(moduleName, featureName, destinationVerb))

var errorStr optional.Option[string]
var stack optional.Option[string]
var responseBody []byte

if callError, ok := call.CallError.Get(); ok {
errorStr = optional.Some(callError.Error())
callFailures.Add(ctx, 1, metric.WithAttributes(moduleName, featureName, destinationVerb))
} else if response, ok := call.Response.Get(); ok {
responseBody = response.GetBody()
if callError := response.GetError(); callError != nil {
errorStr = optional.Some(callError.Message)
stack = optional.Ptr(callError.Stack)
callFailures.Add(ctx, 1, metric.WithAttributes(moduleName, featureName, destinationVerb))
}
}

callLatency.Record(ctx, time.Now().Sub(call.StartTime).Milliseconds(), metric.WithAttributes(moduleName, featureName, destinationVerb))

err := d.InsertCallEvent(ctx, &dal.CallEvent{
Time: call.StartTime,
DeploymentKey: call.DeploymentKey,
RequestKey: optional.Some(call.RequestKey),
Duration: time.Since(call.StartTime),
SourceVerb: sourceVerb,
DestVerb: *call.DestVerb,
Request: call.Request.GetBody(),
Response: responseBody,
Error: errorStr,
Stack: stack,
})
if err != nil {
logger.Errorf(err, "failed to record call")
}
}
46 changes: 46 additions & 0 deletions backend/controller/observability/fsm.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
package observability

import (
"context"
"github.com/TBD54566975/ftl/backend/schema"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/metric"
"sync"
)

var fsmMeter = otel.Meter("ftl.fsm")
var fsmInitOnce = sync.Once{}

var fsmActive metric.Int64UpDownCounter
var fsmTransitions metric.Int64Counter

func initFsmMetrics() {
fsmInitOnce.Do(func() {
fsmActive, _ = fsmMeter.Int64UpDownCounter("ftl.fsm.active",
metric.WithDescription("number of in flight fsm transitions"),
metric.WithUnit("{count}"))

fsmTransitions, _ = fsmMeter.Int64Counter("ftl.fsm.transitions",
metric.WithDescription("number of attempted transitions"),
metric.WithUnit("{count}"))
})
}

func RecordFsmTransitionBegin(ctx context.Context, fsm schema.RefKey) {
initFsmMetrics()

moduleAttr := metricAttributes.moduleName(fsm.Module)
featureAttr := metricAttributes.featureName(fsm.Name)

fsmTransitions.Add(ctx, 1, metric.WithAttributes(moduleAttr, featureAttr))
fsmActive.Add(ctx, 1, metric.WithAttributes(moduleAttr, featureAttr))
}

func RecordFsmTransitionSuccess(ctx context.Context, fsm schema.RefKey) {
initFsmMetrics()

moduleAttr := metricAttributes.moduleName(fsm.Module)
featureAttr := metricAttributes.featureName(fsm.Name)

fsmActive.Add(ctx, -1, metric.WithAttributes(moduleAttr, featureAttr))
}
23 changes: 23 additions & 0 deletions backend/controller/observability/observability.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package observability

import (
"go.opentelemetry.io/otel/attribute"
)

type metricAttributeBuilders struct {
moduleName func(name string) attribute.KeyValue
featureName func(name string) attribute.KeyValue
destinationVerb func(name string) attribute.KeyValue
}
Comment on lines +7 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit excessive, could just do:

func attrModuleName(name string) attribute.KeyValue {...}
func attrFeatureName(name string) attribute.KeyValue {...}
...

Otherwise, it's just more lines to change/maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to explicit functions in the other PR


var metricAttributes = metricAttributeBuilders{
moduleName: func(name string) attribute.KeyValue {
return attribute.String("ftl.module.name", name)
},
featureName: func(name string) attribute.KeyValue {
return attribute.String("ftl.feature.name", name)
},
destinationVerb: func(name string) attribute.KeyValue {
return attribute.String("ftl.verb.dest", name)
},
}
22 changes: 22 additions & 0 deletions backend/runner/runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ import (
"context"
"errors"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
"math/rand"
"net/http"
"net/http/httputil"
Expand Down Expand Up @@ -39,6 +42,13 @@ import (
"github.com/TBD54566975/ftl/internal/unstoppable"
)

const name = "ftl.xyz/ftl/go/runner"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go back to the naming scheme in your google sheet - that's way nicer :)


var (
meter = otel.Meter(name)
instanceCounter metric.Int64UpDownCounter
)
Comment on lines +47 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pull this out into a separate file like you did for the controller? It would be best for all the otel stuff to live in dedicated files. That makes it more obvious to a reader what the pattern is, so when we add new features in the future, it's easy to copy the instrumentation pattern


type Config struct {
Config []string `name:"config" short:"C" help:"Paths to FTL project configuration files." env:"FTL_CONFIG" placeholder:"FILE[,FILE,...]" type:"existingfile"`
Bind *url.URL `help:"Endpoint the Runner should bind to and advertise." default:"http://localhost:8893" env:"FTL_RUNNER_BIND"`
Expand Down Expand Up @@ -107,6 +117,18 @@ func Start(ctx context.Context, config Config) error {
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.RegisterRunner, svc.registrationLoop)
go rpc.RetryStreamingClientStream(ctx, backoff.Backoff{}, controllerClient.StreamDeploymentLogs, svc.streamLogsLoop)

instanceCounter, err = meter.Int64UpDownCounter("ftl.sys.runner.instance",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all the metric names {meter}.{counter}. Rather than hardcode that naming pattern, we can just set var meterName in each individual otel file, and then all the counter names can be constructed as: fmt.Sprintf("%s.requests", meterName) (subbing request for whatever the counter name is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in new branch

metric.WithDescription("number of runner instances"),
metric.WithUnit("{count}"))

if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to crash the whole runner when otel instrumentation fails. Could you take a look at how the existing code users the logger from ctx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted, this meter is also no longer in scope. will apply this feedback in other metric inits

}

moduleNameAttribute := attribute.String("ftl.module.name", "unknown-module")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use your helper function to guarantee consistency, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep


instanceCounter.Add(ctx, 1, metric.WithAttributes(moduleNameAttribute))

return rpc.Serve(ctx, config.Bind,
rpc.GRPC(ftlv1connect.NewVerbServiceHandler, svc),
rpc.GRPC(ftlv1connect.NewRunnerServiceHandler, svc),
Expand Down
Loading
Loading