Add AuthN/AuthZ metrics #59557

MackinnonBuck · 2024-12-18T23:52:15Z

Add AuthN/AuthZ metrics

Adds ASP.NET Core authentication and authorization metrics.

Description

This PR adds the following metrics:

Authentication:
- Authenticated request duration
- Challenge count
- Forbid count
- Sign in count
- Sign out count
Authorization:
- Count of requests requiring authorization

Ready to be reviewed, but the counter names, descriptions, and tags need to go through API review before this merges.

Fixes #47603

Copilot

Copilot reviewed 5 out of 16 changed files in this pull request and generated 1 comment.

Files not reviewed (11)

src/Http/Authentication.Core/src/Microsoft.AspNetCore.Authentication.Core.csproj: Language not supported
src/Http/Authentication.Core/src/PublicAPI.Unshipped.txt: Language not supported
src/Security/Authentication/test/Microsoft.AspNetCore.Authentication.Test.csproj: Language not supported
src/Security/Authorization/Core/src/Microsoft.AspNetCore.Authorization.csproj: Language not supported
src/Security/Authorization/Core/src/PublicAPI/net10.0/PublicAPI.Unshipped.txt: Language not supported
src/Security/Authorization/Core/src/PublicAPI/net462/PublicAPI.Unshipped.txt: Language not supported
src/Security/Authorization/Core/src/PublicAPI/netstandard2.0/PublicAPI.Unshipped.txt: Language not supported
src/Security/Authorization/test/Microsoft.AspNetCore.Authorization.Test.csproj: Language not supported
src/Http/Authentication.Core/src/AuthenticationCoreServiceCollectionExtensions.cs: Evaluated as low risk
src/Http/Authentication.Core/src/AuthenticationMetrics.cs: Evaluated as low risk
src/Http/Authentication.Core/src/AuthenticationService.cs: Evaluated as low risk

src/Security/Authorization/test/AuthorizationMetricsTest.cs

src/Http/Authentication.Core/src/AuthenticationService.cs

src/Http/Authentication.Core/src/AuthenticationMetrics.cs

davidfowl · 2025-01-03T18:40:30Z

@JamesNK @lmolkova

src/Http/Authentication.Core/src/AuthenticationService.cs

src/Security/Authorization/Core/src/DefaultAuthorizationService.cs

JamesNK · 2025-01-04T01:16:51Z

Looking at this PR from a higher level, what are the main scenarios that people will use these? A bad outcome would be us going through the motions of adding metrics so we can say they're there. We should think about scenarios and make the metrics as useful as possible in real world scenarios.

For example, will people use authn/authz metrics to:

View instrument counts to verify that authn/authz is running correctly
See how long auth takes
Debug exceptions
All of the above?

I'm not an auth expert so I'd to hear from the folks who work with auth most often.

Debugging auth that isn't working, including exceptions, seems like it is the most valable scenario. It would be useful to collect metrics on an app that runs into some of the most common errors that people have with auth and see what can be done to make the output useful for them.

For example, in Kestrel from a developer's perspective it's hard to know why a connection was closed. The server knows why, so we have a set of known reasons for closing a connection. When Kestrel records that a connection is closed non-gracefully it includes that known reason as the error.type tag. See the error.type details at https://opentelemetry.io/docs/specs/semconv/dotnet/dotnet-kestrel-metrics/#metric-kestrelconnectionduration

Applying that idea to authn/authz: common reasons for errors could be included as the error.type in some of these metrics:

Policy name not found
Scheme not found
Scheme not specified
etc

Surfacing up known error reasons from auth handlers might also be useful. For example, would it be valuable for the cookie auth handler to add metadata to a metric that it wasn't able to authenticate the HTTP request because it couldn't decrypt the cookie payload?

halter73 · 2025-01-06T21:44:39Z

Also, do people care how long authenticating takes? Should this be a histogram rather than a counter? A histogram can be used for both measuring count and duration. This question applies to a lot of the metrics here because they're in async methods.

This is good feedback for AuthenticateAsync in particular. Normally it should be very fast because you're usually doing something like decrypting cookies with already-downloaded data protection keys or validating JWTs with already-downloaded JWKs, but this can be slowed down by things like the initial downloading of keys or user-defined callbacks.

I don't think histograms are as valuable for challenge, forbid, sign-in and sign-out though. Generally speaking, this are simply setting headers for things like redirects.

For example, in Kestrel from a developer's perspective it's hard to know why a connection was closed.

I think it's worth differentiating expected error cases like connection closed reasons in Kestrel vs. issues caused by misconfiguration.

I consider things like scheme not found and policy name not found to be misconfiguration. If someone were to run into such an error, that should be able to provide repro steps that would allow a developer to try the scenario themselves why the crank up the logging to see what's going on. I don't think we need metrics for these.

On the other hand, it would be interesting to have metrics for errors originating from calls to identity providers via the RemoteAuthenticationOptions.Backchannel HttpClient since these could be transient. Errors reported via the ?error=... query string parameter when identity providers redirect back to the remote authentication handler's CallbackPath could be transient or due to misconfiguration, but I think it'd be worth it to add metrics for these too. The known OAuth error types are defined in https://datatracker.ietf.org/doc/html/rfc6749#section-4.1.2.1

Both of these are specific to remote authentication handlers, so I think it could be addressed in a follow up PR.

lmolkova

Left a few minor naming suggestions. James share the key point in the comments - these metrics should probably be histograms. Even if it does not make sense, and some should remain counters, the counters should be reported after operation completes and should include the error.type attribute in case exception has happened.

lmolkova · 2025-01-08T01:24:16Z

src/Security/Authorization/Core/src/AuthorizationMetrics.cs

+    {
+        if (_authorizeCount.Enabled)
+        {
+            var resultTagValue = result.Succeeded ? "success" : "failure";


is it always a boolean flag? E.g. could we add a error.type or some status differentiating failures?

There could be multiple reasons for a given failure, and each reason is an arbitrary string, so unfortunately I'm not sure we can create distinct categories for failures.

src/Security/Authorization/Core/src/AuthorizationMetrics.cs

lmolkova · 2025-01-08T01:50:20Z

src/Http/Authentication.Core/src/AuthenticationMetrics.cs

+    private readonly Counter<long> _signInCount;
+    private readonly Counter<long> _signOutCount;
+
+    public AuthenticationMetrics(IMeterFactory meterFactory)


it would be great to have a doc describing the meaning and format of each metric and hopefully add them to https://github.com/open-telemetry/semantic-conventions/blob/main/docs/dotnet/dotnet-aspnetcore-metrics.md. Happy to help if necessary.

src/Http/Authentication.Core/src/AuthenticationMetrics.cs

MackinnonBuck · 2025-01-09T00:56:13Z

Thanks for all the feedback, everyone!

We had a discussion offline about how to address the feedback here, and this is what we've landed on:

Change the instrument for AuthenticateAsync() to be a histogram
Report exceptions via the error.type tag/attribute
Add internal *Impl classes for AuthenticationService and DefaultAuthorizationService to avoid introducing new public API

The latest revisions of this PR include these changes.

Debugging auth that isn't working, including exceptions, seems like it is the most valuable scenario.

The current implementation just adds an error.type attribute if an exception gets thrown, but I'm not sure how useful that will be if most reported values are just, e.g., System.InvalidOperationException. Its primary use might be determining how often exceptions are getting thrown rather than what the details are. As @halter73 mentioned, the developer could change the log level if they want to debug more deeply. Do others agree with this, or do we need to include more exception/failure info in these metrics?

src/Security/Authorization/Core/src/DefaultAuthorizationServiceImpl.cs

src/Http/Authentication.Core/src/AuthenticationMetrics.cs

JamesNK · 2025-01-13T03:23:22Z

The current implementation just adds an error.type attribute if an exception gets thrown, but I'm not sure how useful that will be if most reported values are just, e.g., System.InvalidOperationException. Its primary use might be determining how often exceptions are getting thrown rather than what the details are. As @halter73 mentioned, the developer could change the log level if they want to debug more deeply. Do others agree with this, or do we need to include more exception/failure info in these metrics?

The exception name in error.type isn't that useful. It's basically a way to test if there is an error, e.g. a promql query like aspnetcore.authentication.request.duration{'error.type'!=''}.

Including known errors is what makes something like error.type more useful. For example, someone can see that their app which uses Facebook login has 10,000 successful logins, 100 failures because someone clicked login but decided not to give the app permission to use the Facebook profile, 50 failures because there was a transient network error calling a Facebook API, etc.

Metrics are useful when providing an aggregate system view of operations in the app. Telling people to look at the logs to get more information is a solution to someone who needs to debug an error at a particular time, but dev ops folks lose visibility of what the overall cause of errors in auth when looking at thousands of aggreated metrics. Put yourself in their shoes and think about how errors and failures can be categorized, grouped and reported.

Because so much of the cause of auth failure is in the handler, I wonder if handler's need an API to specify tags that are then reported on. The higher level AuthenticationService then uses that API to add tags to the metric.

If it turns out that categorizing and reporting known failure reasons is more work than we want to invest now then that's fine. But before we decide that we should see what kind of information we could provide, come up with a design and estimate the work it involved.

MackinnonBuck · 2025-01-15T19:28:52Z

We discussed this offline and decided that the current implementation wouldn't block us from adding more detailed error reporting in the future.

The authentication handlers provided by ASP.NET Core already have a set of well-known failure reasons (see 1, 2, 3, for example). From these, we could define a set of values for a new aspnetcore.authentication.failure_reason attribute. This set could be something like:

UnprotectTicketFailed
UnprotectTokenFailed
MissingSessionId
MissingIdentityInSession
TicketExpired
TokenExpired
TokenHandlerUnableToValidate
InvalidClientCertificate
NoChainedCertificates
NoSelfSignedCertificates
NoPrincipal
ValidatorNotFound
HandlerDoesNotSupportAuthentication
Unknown

(This is just an example. We would inevitably change or shorten this list in an API review.)

We could specify the failure reason either on the AuthenticateResult, AuthenticationProperties, or AuthenticationFailureException. The metrics logic would then use the specified reason populate the "failure reason" attribute.

We could also consider providing a way to allow handlers to add arbitrary attributes to report, but this may require a more elaborate design (e.g., handlers would need a way of determining whether metrics are enabled before generating custom attributes).

Anyway, the conclusion was that we could go ahead with the PR as it stands and address these further items in a follow-up.

MackinnonBuck · 2025-01-15T19:29:46Z

/azp run

azure-pipelines · 2025-01-15T19:30:07Z

Azure Pipelines successfully started running 3 pipeline(s).

MackinnonBuck added 2 commits December 18, 2024 15:32

Add authentication metrics

e50ed9c

Add authorization metrics

d6309e2

MackinnonBuck requested review from BrennanConroy, halter73 and captainsafia as code owners December 18, 2024 23:52

dotnet-issue-labeler bot added the area-security label Dec 18, 2024

MackinnonBuck requested a review from Copilot December 18, 2024 23:52

Copilot AI reviewed Dec 18, 2024

View reviewed changes

src/Security/Authorization/test/AuthorizationMetricsTest.cs Outdated Show resolved Hide resolved

Add missing reference

b08e436

dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Dec 27, 2024

Kahbazi reviewed Dec 28, 2024

View reviewed changes

src/Http/Authentication.Core/src/AuthenticationService.cs Outdated Show resolved Hide resolved

mkArtakMSFT reviewed Jan 3, 2025

View reviewed changes

src/Http/Authentication.Core/src/AuthenticationMetrics.cs Outdated Show resolved Hide resolved

MackinnonBuck requested a review from JamesNK January 3, 2025 23:10

JamesNK reviewed Jan 4, 2025

View reviewed changes

src/Http/Authentication.Core/src/AuthenticationService.cs Outdated Show resolved Hide resolved

JamesNK reviewed Jan 4, 2025

View reviewed changes

src/Security/Authorization/Core/src/DefaultAuthorizationService.cs Outdated Show resolved Hide resolved

lmolkova reviewed Jan 8, 2025

View reviewed changes

MackinnonBuck added 2 commits January 8, 2025 15:02

PR feedback for authentication metrics

6c69508

PR feedback for authorization metrics

1cec21f

JamesNK reviewed Jan 9, 2025

View reviewed changes

src/Security/Authorization/Core/src/DefaultAuthorizationServiceImpl.cs Show resolved Hide resolved

src/Http/Authentication.Core/src/AuthenticationMetrics.cs Outdated Show resolved Hide resolved

PR feedback

f5350c3

mkArtakMSFT added area-auth Includes: Authn, Authz, OAuth, OIDC, Bearer and removed area-security labels Jan 10, 2025

MackinnonBuck mentioned this pull request Jan 10, 2025

[API Proposal]: AuthN/AuthZ metrics #59818

Open

dotnet-policy-service bot removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AuthN/AuthZ metrics #59557

Add AuthN/AuthZ metrics #59557

MackinnonBuck commented Dec 18, 2024 •

edited

Loading

Copilot AI left a comment

davidfowl commented Jan 3, 2025

JamesNK commented Jan 4, 2025 •

edited

Loading

halter73 commented Jan 6, 2025

lmolkova left a comment

lmolkova Jan 8, 2025

MackinnonBuck Jan 9, 2025

lmolkova Jan 8, 2025

MackinnonBuck commented Jan 9, 2025

JamesNK commented Jan 13, 2025 •

edited

Loading

MackinnonBuck commented Jan 15, 2025

MackinnonBuck commented Jan 15, 2025

azure-pipelines bot commented Jan 15, 2025

Add AuthN/AuthZ metrics #59557

Are you sure you want to change the base?

Add AuthN/AuthZ metrics #59557

Conversation

MackinnonBuck commented Dec 18, 2024 • edited Loading

Add AuthN/AuthZ metrics

Description

Copilot AI left a comment

Choose a reason for hiding this comment

davidfowl commented Jan 3, 2025

JamesNK commented Jan 4, 2025 • edited Loading

halter73 commented Jan 6, 2025

lmolkova left a comment

Choose a reason for hiding this comment

lmolkova Jan 8, 2025

Choose a reason for hiding this comment

MackinnonBuck Jan 9, 2025

Choose a reason for hiding this comment

lmolkova Jan 8, 2025

Choose a reason for hiding this comment

MackinnonBuck commented Jan 9, 2025

JamesNK commented Jan 13, 2025 • edited Loading

MackinnonBuck commented Jan 15, 2025

MackinnonBuck commented Jan 15, 2025

azure-pipelines bot commented Jan 15, 2025

MackinnonBuck commented Dec 18, 2024 •

edited

Loading

JamesNK commented Jan 4, 2025 •

edited

Loading

JamesNK commented Jan 13, 2025 •

edited

Loading