Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a retry mechanism to UploadCompleter's session.end checker #41113

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tigrato
Copy link
Contributor

@tigrato tigrato commented May 1, 2024

When uploading session assets, UploadCompleter checks if the stale upload contains the session.end event. This check is done using StreamSessionEvents call to Auth Server to retrieve all the events recorded for a particular session. When the session misses the session.end event, the upload completer fakes it and creates it in order to Teleport being able to display the session recordings. Teleport requires session.end to be able to display them correctly.

The upload completer after finalyzing the upload, schedules a goroutine to run 2min after to properly range all events in the session and creating the event if missed.

The issue happens if the upload completer can successfuly complete the upload - auth is operational and s3 storage is also operationa -l but audit events backend isn't available at the time. In those cases, although the recording was correctly completed, emiting a session.end event will fail and no-one will be able to retrieve or analyze the session recordings content.

This PR adds a logic to keep ensuring the session properly has an end event until the audit log backend is operational again.

@tigrato tigrato force-pushed the tigrato/fix-event-completer branch 2 times, most recently from a157a93 to 2171368 Compare May 1, 2024 20:50
When uploading session assets, `UploadCompleter` checks if the stale
upload contains the `session.end` event. This check is done using
`StreamSessionEvents` call to Auth Server to retrieve all the events
recorded for a particular session. When the session misses the
`session.end` event, the upload completer fakes it and creates it in
order to Teleport being able to display the session recordings. Teleport
requires `session.end` to be able to display them correctly.

The upload completer after finalyzing the upload, schedules a goroutine
to run 2min after to properly range all events in the session and
creating the event if missed.

The issue happens if the upload completer can successfuly complete the
upload - auth is operational and s3 storage is also operationa -l but
audit events backend isn't available at the time. In those cases,
although the recording was correctly completed, emiting a `session.end`
event will fail and no-one will be able to retrieve or analyze the
session recordings content.

This PR adds a logic to keep ensuring the session properly has an end
event until the audit log backend is operational again.

Signed-off-by: Tiago Silva <[email protected]>
@tigrato tigrato force-pushed the tigrato/fix-event-completer branch from 2171368 to 6498d6f Compare May 1, 2024 21:02
@@ -315,7 +334,24 @@ func (u *UploadCompleter) CheckUploads(ctx context.Context) error {
return nil
}

func (u *UploadCompleter) ensureSessionEndEvent(ctx context.Context, uploadData UploadMetadata) error {
func (u *UploadCompleter) ensureAllPendingsessionsHaveEndEvts(ctx context.Context) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (u *UploadCompleter) ensureAllPendingsessionsHaveEndEvts(ctx context.Context) {
func (u *UploadCompleter) ensureAllPendingSessionsHaveEndEvents(ctx context.Context) {

@@ -416,3 +452,38 @@ loop:
}
return nil
}

func (u *UploadCompleter) createIncompleteSession(sessionID session.ID) error {
path := filepath.Join(u.cfg.UnconfirmedSessionEndDir, sessionID.String())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we're creating an empty file on disk named after the session ID as in indicator to go back and retry the session.

But then when we retry we have to stream the entire session again to compute what the end event should look like.

Why don't we just write the JSON-encoded end event to this file, so subsequent retries can emit the end event directly rather than having to compute it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have the event every time.
If the audit log backend is suffering a temporary failure, streaming can fail and we can't even validate if the session already has the session.end event.

Saying that, we need to retry later to:

  1. pull all session events and check whether we have a session.end or not
  2. if audit log failed, create the file and retry later.
  3. if session doesn't have the session.end event, try creating it. If create audit event operation fails we retry later. For this step we can implement the optimization you mentioned but we still need to deal with step 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants