-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a retry mechanism to UploadCompleter
's session.end
checker
#41113
base: master
Are you sure you want to change the base?
Conversation
a157a93
to
2171368
Compare
When uploading session assets, `UploadCompleter` checks if the stale upload contains the `session.end` event. This check is done using `StreamSessionEvents` call to Auth Server to retrieve all the events recorded for a particular session. When the session misses the `session.end` event, the upload completer fakes it and creates it in order to Teleport being able to display the session recordings. Teleport requires `session.end` to be able to display them correctly. The upload completer after finalyzing the upload, schedules a goroutine to run 2min after to properly range all events in the session and creating the event if missed. The issue happens if the upload completer can successfuly complete the upload - auth is operational and s3 storage is also operationa -l but audit events backend isn't available at the time. In those cases, although the recording was correctly completed, emiting a `session.end` event will fail and no-one will be able to retrieve or analyze the session recordings content. This PR adds a logic to keep ensuring the session properly has an end event until the audit log backend is operational again. Signed-off-by: Tiago Silva <[email protected]>
2171368
to
6498d6f
Compare
@@ -315,7 +334,24 @@ func (u *UploadCompleter) CheckUploads(ctx context.Context) error { | |||
return nil | |||
} | |||
|
|||
func (u *UploadCompleter) ensureSessionEndEvent(ctx context.Context, uploadData UploadMetadata) error { | |||
func (u *UploadCompleter) ensureAllPendingsessionsHaveEndEvts(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (u *UploadCompleter) ensureAllPendingsessionsHaveEndEvts(ctx context.Context) { | |
func (u *UploadCompleter) ensureAllPendingSessionsHaveEndEvents(ctx context.Context) { |
@@ -416,3 +452,38 @@ loop: | |||
} | |||
return nil | |||
} | |||
|
|||
func (u *UploadCompleter) createIncompleteSession(sessionID session.ID) error { | |||
path := filepath.Join(u.cfg.UnconfirmedSessionEndDir, sessionID.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so we're creating an empty file on disk named after the session ID as in indicator to go back and retry the session.
But then when we retry we have to stream the entire session again to compute what the end event should look like.
Why don't we just write the JSON-encoded end event to this file, so subsequent retries can emit the end event directly rather than having to compute it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't have the event every time.
If the audit log backend is suffering a temporary failure, streaming can fail and we can't even validate if the session already has the session.end
event.
Saying that, we need to retry later to:
- pull all session events and check whether we have a
session.end
or not - if audit log failed, create the file and retry later.
- if session doesn't have the
session.end
event, try creating it. If create audit event operation fails we retry later. For this step we can implement the optimization you mentioned but we still need to deal with step 2.
When uploading session assets,
UploadCompleter
checks if the stale upload contains thesession.end
event. This check is done usingStreamSessionEvents
call to Auth Server to retrieve all the events recorded for a particular session. When the session misses thesession.end
event, the upload completer fakes it and creates it in order to Teleport being able to display the session recordings. Teleport requiressession.end
to be able to display them correctly.The upload completer after finalyzing the upload, schedules a goroutine to run 2min after to properly range all events in the session and creating the event if missed.
The issue happens if the upload completer can successfuly complete the upload - auth is operational and s3 storage is also operationa -l but audit events backend isn't available at the time. In those cases, although the recording was correctly completed, emiting a
session.end
event will fail and no-one will be able to retrieve or analyze the session recordings content.This PR adds a logic to keep ensuring the session properly has an end event until the audit log backend is operational again.