Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use time-based event index for app session events #38495

Merged

Conversation

gabrielcorado
Copy link
Contributor

@gabrielcorado gabrielcorado commented Feb 21, 2024

Closes #37049

As described on the issue, the events were being overwritten on DynamoDB events storage because they had the same key pair (SessionID and Index). The problem is that every time a new session chunk is initialized, it uses a new recorder, which would then have the index start at 0. This would cause any previous event from the session to be overwritten.

The solution proposed on this PR is, for app sessions, to generate an event index based on the session start (here, the time the certificate was issued), removing the necessity of a distributed counter.

In addition, the solution also needed to take into account if different servers handle an app session. For app access, the sessions are not ended if the server handling it changes (HA). In case that happens, the recorder would start over at index 0 and overwrite all the events emitted by the proxy and other app server.

NOTE: This change only affects app session recordings; other resources will still use the incremental method.

This PR also adds a DynamoDB condition expression to guarantee events are not being overwritten on the storage layer.

changelog: Fix application access events being overwritten when using DynamoDB as event storage.

@gabrielcorado gabrielcorado self-assigned this Feb 21, 2024
@github-actions github-actions bot added application-access audit-log Issues related to Teleports Audit Log size/md labels Feb 21, 2024
Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

lib/events/dynamoevents/dynamoevents_test.go Outdated Show resolved Hide resolved
lib/events/setter.go Outdated Show resolved Hide resolved
lib/events/setter_test.go Outdated Show resolved Hide resolved
lib/events/recorder/recorder.go Outdated Show resolved Hide resolved
lib/events/setter.go Outdated Show resolved Hide resolved
lib/events/setter.go Show resolved Hide resolved
lib/srv/app/server.go Outdated Show resolved Hide resolved
@@ -1187,3 +1205,13 @@ func newGetConfigForClientFn(log logrus.FieldLogger, client auth.AccessCache, tl
return tlsCopy, nil
}
}

// certFromConnState returns certificate from the connection.
func certFromConn(tlsConn *tls.Conn) *x509.Certificate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func certFromConn(tlsConn *tls.Conn) *x509.Certificate {
func leafCertFromConn(tlsConn *tls.Conn) *x509.Certificate {

lib/srv/app/server.go Outdated Show resolved Hide resolved
lib/srv/app/server.go Outdated Show resolved Hide resolved
lib/events/setter_test.go Show resolved Hide resolved
lib/events/setter_test.go Outdated Show resolved Hide resolved
lib/events/setter_test.go Outdated Show resolved Hide resolved
@codingllama
Copy link
Contributor

Overall looks good, added a few questions as I'm not super familiar with these parts.

Copy link
Contributor

@greedy52 greedy52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. The key index is used in app sessions recordings as well. After this change, ppl will see big numbers as key index instead of 1,2,3, so we need to document this behavior somewhere.

TableName: aws.String(l.Tablename),
Item: av,
TableName: aws.String(l.Tablename),
ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this break existing environments when auth is upgraded but app server is not yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as soon as they update their auth servers and events from older app servers, they will hit this, at least for the first app.session.chunk (which conflicts with app.session.start). As mentioned in the PR description, if the app server serving the requests changes, there will be more conflicts (generating more errors).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.

Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having an error log to silently overwriting audit events (current behavior).

@codingllama, Any thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't have enough context to have a strong opinion here.

// valid date.
func (s *Server) sessionStartTime(ctx context.Context) time.Time {
var startTime time.Time
if userCert, err := authz.UserCertificateFromContext(ctx); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed interesting approach. Chunks are every five minutes if I recall. So using nanoseconds ensures chunks are relatively in order. However in HA case, time on different app servers may not be perfectly in sync.

@codingllama
Copy link
Contributor

After this change, ppl will see big numbers as key index instead of 1,2,3, so we need to document this behavior somewhere.

We could also use UnixNano for "smaller" numbers, but it'll still be quite large compared to "1,2,3".

@greedy52 greedy52 self-requested a review February 23, 2024 18:48
Copy link
Contributor

@greedy52 greedy52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested today and worked as expected 👍 . Do we expect this change affect other event backend?

TableName: aws.String(l.Tablename),
Item: av,
TableName: aws.String(l.Tablename),
ConditionExpression: aws.String("attribute_not_exists(SessionID) AND attribute_not_exists(EventIndex)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we release this check with the patch, customers's auth will be "spammed" with this before app servers are upgraded. As discussed offline, it is preferred to handle this separately and ideally release it in a new major version. By that time, app servers should have been upgraded with this patch already and we can likely catch other events that have the same problem during the release testing.

Though, since this check does not block using the app, and we are losing events before app servers are upgraded anyway, I don't mind if we really want to release this together.

@gabrielcorado gabrielcorado added this pull request to the merge queue Feb 29, 2024
Merged via the queue into master with commit 89d1aa4 Feb 29, 2024
34 checks passed
@gabrielcorado gabrielcorado deleted the gabrielcorado/avoid-index-collision-app-sessions branch February 29, 2024 03:33
@public-teleport-github-review-bot

@gabrielcorado See the table below for backport results.

Branch Result
branch/v13 Failed
branch/v14 Create PR
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

app.session.start events do not appear in DynamoDB backend
4 participants