Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record the time between a task being triggered and it being successful run #217

Merged
merged 10 commits into from
Dec 18, 2024

Conversation

aaronrosser
Copy link
Contributor

@aaronrosser aaronrosser commented Dec 12, 2024

Context

We would like to record the time between a task being triggered and the task processing being handled over to application code - ie the time take for the task to

  • Be sent / wait in a kafka topic
  • Get a concurrency slot

The metric should ideally account for any other random delays eg rebalance / back offs / retryAfters...

Checklist

Comment on lines 199 to 200
long timeTillProcessingStarted = epochMilliBeforeProcessing - taskTriggering.getTriggerAt().toEpochMilli();
meterCache.timer(METRIC_TASKS_TASK_GRABBING_TIME, tags).record(timeTillProcessingStarted, TimeUnit.MILLISECONDS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response code in tags allows filtering for just successful run tasks


coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
long epochMilliBeforeProcessing = System.currentTimeMillis();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record time before grabTaskForProcessing to remove the task processing time from metric.

Maybe would be cleaner to add epochMilliBeforeProcessing to ProcessTaskResponse and have it be returned from grabTaskForProcessing 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with proposal, could be part of the response

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that in the successful flow (code = OK) the final task grabbing / execution is handed off to another thread which is not awaited so can just use the time the metric method is called at 😄

tasksGrabbingExecutor.submit(() -> {
try {
grabTaskForProcessing0(bucket, task, concurrencyPolicy, taskHandler);
} catch (Throwable t) {
log.error("Task grabbing failed for '{}'.", task.getVersionId(), t);
}
});

.setTask(task)
.setBucketId(bucketId)
.setOffset(offset)
.setTriggerAt(Instant.ofEpochSecond(consumerRecord.timestamp()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If triggered via kafka use consumer record timestamp (time broker register event / time producer set when triggering)

Comment on lines 184 to 187
TaskTriggering taskTriggering = new TaskTriggering()
.setTask(task)
.setBucketId(processingBucketId)
.setTriggerAt(Instant.now());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If task triggered in same process - ran on the pod that triggered it without kafka just use time now

@aaronrosser aaronrosser changed the title Example recording time from trigger timestamp to grab before successf… Record the time between a task being triggered and it being successful run Dec 18, 2024
@aaronrosser aaronrosser marked this pull request as ready for review December 18, 2024 10:54
@aaronrosser aaronrosser requested a review from a team as a code owner December 18, 2024 10:54
maru-tw
maru-tw previously approved these changes Dec 18, 2024
Copy link
Contributor

@maru-tw maru-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Would need changelog and version bump.


coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
long epochMilliBeforeProcessing = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to name it epochMilliGrabForProcessing ? describing the current state, instead of the outcome (grabbing the task Vs processing the task)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as not needed


coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
long epochMilliBeforeProcessing = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with proposal, could be part of the response

@wise-github-bot-app
Copy link

The approval(s) from maru-tw do(es)n't fullfill the approvers requirements because:

  • The approver's cost centre, ENGREG, maps to the ENGINEERING business function. As the code that was changed is owned by PLATFORM, this approval won't satisfy our separation of duties check. We'll need an additional approval from someone in PLATFORM. This approval may still help satisfy other codeowner requirements.

.setTask(task)
.setBucketId(bucketId)
.setOffset(offset)
.setTriggerAt(Instant.ofEpochMilli(consumerRecord.timestamp()))
Copy link
Contributor Author

@aaronrosser aaronrosser Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maru-tw
maru-tw previously approved these changes Dec 18, 2024
@wise-github-bot-app
Copy link

The approval(s) from maru-tw do(es)n't fullfill the approvers requirements because:

  • The approver's cost centre, ENGREG, maps to the ENGINEERING business function. As the code that was changed is owned by PLATFORM, this approval won't satisfy our separation of duties check. We'll need an additional approval from someone in PLATFORM. This approval may still help satisfy other codeowner requirements.

meterCache.counter(METRIC_TASKS_TASK_GRABBING, tags).increment();

long millisSinceTaskTriggered = System.currentTimeMillis() - taskTriggeredAt.toEpochMilli();
meterCache.timer(METRIC_TASKS_TASK_GRABBING_TIME, tags).record(millisSinceTaskTriggered, TimeUnit.MILLISECONDS);
Copy link
Contributor Author

@aaronrosser aaronrosser Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one other timer currently exposed by tw tasks

meterCache.timer(METRIC_TASKS_PROCESSING_TIME, TagsSet.of(TAG_BUCKET_ID, resolvedBucketId, TAG_TASK_TYPE, taskType,
TAG_PROCESSING_RESULT, processingResult))
.record(TwContextClockHolder.getClock().millis() - processingStartTimeMs, TimeUnit.MILLISECONDS);

Cannot see it setting histogram buckets via a meter filter or anything so assume leave it to service owners to override

normanma-tw
normanma-tw previously approved these changes Dec 18, 2024
@@ -213,8 +213,7 @@ protected ProcessTasksResponse processTasks(GlobalProcessingState.Bucket bucket)
mdcService.put(task);
try {
ProcessTaskResponse processTaskResponse = grabTaskForProcessing(bucketId, task);

coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse, taskTriggering.getTriggerAt());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause a NPE with old version of taskTriggering DTO? One that has no triggeredAt saved yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like TaskTriggering is initialised in two places and we set the triggeredAt with both of them.

Kafka one comes from a long from the event timestamp and other one comes from Instant.now() so think should be fine.

Would prefer to make triggeredAt final and force its initiation during constructor but seems the library exclusively uses init then set pattern so followed this

TaskTriggering taskTriggering = new TaskTriggering()
.setTask(task)
.setBucketId(processingBucketId)
.setTriggerAt(Instant.now());

TaskTriggering taskTriggering = new TaskTriggering()
.setTask(task)
.setBucketId(bucketId)
.setOffset(offset)
.setTriggerAt(Instant.ofEpochMilli(consumerRecord.timestamp()))
.setTopicPartition(topicPartition);

@aaronrosser aaronrosser dismissed stale reviews from normanma-tw and maru-tw via 0aa8dfe December 18, 2024 14:57

meterCache.counter(METRIC_TASKS_TASK_GRABBING, tags).increment();

long millisSinceTaskTriggered = System.currentTimeMillis() - taskTriggeredAt.toEpochMilli();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think if you add a sane default here, even affecting quality of metric that way, it would be safer.

We don't know, when someone changes that logic and null pops here by accident.

Comment on lines +197 to +199
long millisSinceTaskTriggered = taskTriggeredAt != null
? System.currentTimeMillis() - taskTriggeredAt.toEpochMilli()
: 0;
Copy link
Contributor Author

@aaronrosser aaronrosser Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also just have an if condition and not record if null 🤔

0ms is within reason for same pod triggering though

@aaronrosser aaronrosser merged commit 1795a34 into master Dec 18, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants