Record the time between a task being triggered and it being successful run #217

aaronrosser · 2024-12-12T15:43:52Z

Context

We would like to record the time between a task being triggered and the task processing being handled over to application code - ie the time take for the task to

Be sent / wait in a kafka topic
Get a concurrency slot

The metric should ideally account for any other random delays eg rebalance / back offs / retryAfters...

Checklist

Change meets or does not compromise the Baseline Security Requirements

aaronrosser · 2024-12-18T10:29:38Z

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/CoreMetricsTemplate.java

+    long timeTillProcessingStarted = epochMilliBeforeProcessing - taskTriggering.getTriggerAt().toEpochMilli();
+    meterCache.timer(METRIC_TASKS_TASK_GRABBING_TIME, tags).record(timeTillProcessingStarted, TimeUnit.MILLISECONDS);


Response code in tags allows filtering for just successful run tasks

aaronrosser · 2024-12-18T10:30:57Z

tw-tasks-core/src/main/java/com/transferwise/tasks/processing/TasksProcessingService.java


-            coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
+            long epochMilliBeforeProcessing = System.currentTimeMillis();


Record time before grabTaskForProcessing to remove the task processing time from metric.

Maybe would be cleaner to add epochMilliBeforeProcessing to ProcessTaskResponse and have it be returned from grabTaskForProcessing 🤔

agree with proposal, could be part of the response

So it turns out that in the successful flow (code = OK) the final task grabbing / execution is handed off to another thread which is not awaited so can just use the time the metric method is called at 😄

tw-tasks-executor/tw-tasks-core/src/main/java/com/transferwise/tasks/processing/TasksProcessingService.java

Lines 339 to 345 in f1b9e2d

tasksGrabbingExecutor.submit(() -> {

try {

grabTaskForProcessing0(bucket, task, concurrencyPolicy, taskHandler);

} catch (Throwable t) {

log.error("Task grabbing failed for '{}'.", task.getVersionId(), t);

}

});

aaronrosser · 2024-12-18T10:31:43Z

tw-tasks-core/src/main/java/com/transferwise/tasks/triggering/KafkaTasksExecutionTriggerer.java

+                .setTask(task)
+                .setBucketId(bucketId)
+                .setOffset(offset)
+                .setTriggerAt(Instant.ofEpochSecond(consumerRecord.timestamp()))


If triggered via kafka use consumer record timestamp (time broker register event / time producer set when triggering)

aaronrosser · 2024-12-18T10:32:31Z

tw-tasks-core/src/main/java/com/transferwise/tasks/triggering/KafkaTasksExecutionTriggerer.java

+      TaskTriggering taskTriggering = new TaskTriggering()
+          .setTask(task)
+          .setBucketId(processingBucketId)
+          .setTriggerAt(Instant.now());


If task triggered in same process - ran on the pod that triggered it without kafka just use time now

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/ICoreMetricsTemplate.java

maru-tw

Looks good.
Would need changelog and version bump.

maru-tw · 2024-12-18T11:09:28Z

tw-tasks-core/src/main/java/com/transferwise/tasks/processing/TasksProcessingService.java


-            coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
+            long epochMilliBeforeProcessing = System.currentTimeMillis();


Would it be better to name it epochMilliGrabForProcessing ? describing the current state, instead of the outcome (grabbing the task Vs processing the task)

Removed as not needed

maru-tw · 2024-12-18T11:10:11Z

tw-tasks-core/src/main/java/com/transferwise/tasks/processing/TasksProcessingService.java


-            coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
+            long epochMilliBeforeProcessing = System.currentTimeMillis();


agree with proposal, could be part of the response

wise-github-bot-app · 2024-12-18T11:11:02Z

The approval(s) from maru-tw do(es)n't fullfill the approvers requirements because:

The approver's cost centre, ENGREG, maps to the ENGINEERING business function. As the code that was changed is owned by PLATFORM, this approval won't satisfy our separation of duties check. We'll need an additional approval from someone in PLATFORM. This approval may still help satisfy other codeowner requirements.

…ul process

aaronrosser · 2024-12-18T12:00:52Z

tw-tasks-core/src/main/java/com/transferwise/tasks/triggering/KafkaTasksExecutionTriggerer.java

+                .setTask(task)
+                .setBucketId(bucketId)
+                .setOffset(offset)
+                .setTriggerAt(Instant.ofEpochMilli(consumerRecord.timestamp()))


While not clear in the docs consumerRecord.timestamp() appears to be epoch time in millis

https://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/ConsumerRecord.html#timestamp--

https://kafka.apache.org/10/javadoc/org/apache/kafka/connect/data/Timestamp.html

https://stackoverflow.com/questions/67759248/what-is-the-kafka-message-timestamp-represents

wise-github-bot-app · 2024-12-18T12:11:02Z

The approval(s) from maru-tw do(es)n't fullfill the approvers requirements because:

The approver's cost centre, ENGREG, maps to the ENGINEERING business function. As the code that was changed is owned by PLATFORM, this approval won't satisfy our separation of duties check. We'll need an additional approval from someone in PLATFORM. This approval may still help satisfy other codeowner requirements.

aaronrosser · 2024-12-18T13:14:02Z

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/CoreMetricsTemplate.java

+    meterCache.counter(METRIC_TASKS_TASK_GRABBING, tags).increment();
+
+    long millisSinceTaskTriggered = System.currentTimeMillis() - taskTriggeredAt.toEpochMilli();
+    meterCache.timer(METRIC_TASKS_TASK_GRABBING_TIME, tags).record(millisSinceTaskTriggered, TimeUnit.MILLISECONDS);


There is one other timer currently exposed by tw tasks

tw-tasks-executor/tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/CoreMetricsTemplate.java

Lines 156 to 158 in 3bdf72e

meterCache.timer(METRIC_TASKS_PROCESSING_TIME, TagsSet.of(TAG_BUCKET_ID, resolvedBucketId, TAG_TASK_TYPE, taskType,

TAG_PROCESSING_RESULT, processingResult))

.record(TwContextClockHolder.getClock().millis() - processingStartTimeMs, TimeUnit.MILLISECONDS);

Cannot see it setting histogram buckets via a meter filter or anything so assume leave it to service owners to override

xSeagullx · 2024-12-18T14:38:36Z

tw-tasks-core/src/main/java/com/transferwise/tasks/processing/TasksProcessingService.java

@@ -213,8 +213,7 @@ protected ProcessTasksResponse processTasks(GlobalProcessingState.Bucket bucket)
          mdcService.put(task);
          try {
            ProcessTaskResponse processTaskResponse = grabTaskForProcessing(bucketId, task);
-
-            coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
+            coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse, taskTriggering.getTriggerAt());


Will this cause a NPE with old version of taskTriggering DTO? One that has no triggeredAt saved yet.

Looks like TaskTriggering is initialised in two places and we set the triggeredAt with both of them.

Kafka one comes from a long from the event timestamp and other one comes from Instant.now() so think should be fine.

Would prefer to make triggeredAt final and force its initiation during constructor but seems the library exclusively uses init then set pattern so followed this

tw-tasks-executor/tw-tasks-core/src/main/java/com/transferwise/tasks/triggering/KafkaTasksExecutionTriggerer.java

Lines 184 to 187 in e995b03

TaskTriggering taskTriggering = new TaskTriggering()

.setTask(task)

.setBucketId(processingBucketId)

.setTriggerAt(Instant.now());

tw-tasks-executor/tw-tasks-core/src/main/java/com/transferwise/tasks/triggering/KafkaTasksExecutionTriggerer.java

Lines 291 to 296 in e995b03

TaskTriggering taskTriggering = new TaskTriggering()

.setTask(task)

.setBucketId(bucketId)

.setOffset(offset)

.setTriggerAt(Instant.ofEpochMilli(consumerRecord.timestamp()))

.setTopicPartition(topicPartition);

xSeagullx · 2024-12-18T15:03:01Z

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/CoreMetricsTemplate.java

+
+    meterCache.counter(METRIC_TASKS_TASK_GRABBING, tags).increment();
+
+    long millisSinceTaskTriggered = System.currentTimeMillis() - taskTriggeredAt.toEpochMilli();


I just think if you add a sane default here, even affecting quality of metric that way, it would be safer.

We don't know, when someone changes that logic and null pops here by accident.

aaronrosser · 2024-12-18T15:07:52Z

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/CoreMetricsTemplate.java

+    long millisSinceTaskTriggered = taskTriggeredAt != null
+        ? System.currentTimeMillis() - taskTriggeredAt.toEpochMilli()
+        : 0;


Could also just have an if condition and not record if null 🤔

0ms is within reason for same pod triggering though

aaronrosser commented Dec 18, 2024

View reviewed changes

tw-tasks-core/src/main/java/com/transferwise/tasks/helpers/ICoreMetricsTemplate.java Outdated Show resolved Hide resolved

aaronrosser changed the title ~~Example recording time from trigger timestamp to grab before successf…~~ Record the time between a task being triggered and it being successful run Dec 18, 2024

aaronrosser marked this pull request as ready for review December 18, 2024 10:54

aaronrosser requested a review from a team as a code owner December 18, 2024 10:54

maru-tw previously approved these changes Dec 18, 2024

View reviewed changes

aaronrosser dismissed maru-tw’s stale review via f1b9e2d December 18, 2024 11:31

aaronrosser added 6 commits December 18, 2024 11:34

Example recording time from trigger timestamp to grab before successf…

f4d51bd

…ul process

Clean up example

7f5eeed

Clean up example

814185c

Checkstyle

2653fd9

Checkstyle

55f849e

Simplify flow

bc7d5d5

aaronrosser force-pushed the CPL_Example-metrics branch from f1b9e2d to 96168e1 Compare December 18, 2024 11:36

Add changelog / bump version

3bdf72e

aaronrosser force-pushed the CPL_Example-metrics branch from 96168e1 to 3bdf72e Compare December 18, 2024 11:37

Fix timestamp uni

e995b03

aaronrosser commented Dec 18, 2024

View reviewed changes

maru-tw previously approved these changes Dec 18, 2024

View reviewed changes

aaronrosser commented Dec 18, 2024

View reviewed changes

normanma-tw previously approved these changes Dec 18, 2024

View reviewed changes

xSeagullx reviewed Dec 18, 2024

View reviewed changes

Fix triggeredAt name

0aa8dfe

aaronrosser dismissed stale reviews from normanma-tw and maru-tw via 0aa8dfe December 18, 2024 14:57

xSeagullx reviewed Dec 18, 2024

View reviewed changes

Default to 0 if triggeredAt not set

e4c9124

aaronrosser commented Dec 18, 2024

View reviewed changes

xSeagullx approved these changes Dec 18, 2024

View reviewed changes

aaronrosser merged commit 1795a34 into master Dec 18, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record the time between a task being triggered and it being successful run #217

Record the time between a task being triggered and it being successful run #217

aaronrosser commented Dec 12, 2024 •

edited

Loading

aaronrosser Dec 18, 2024

aaronrosser Dec 18, 2024

maru-tw Dec 18, 2024

aaronrosser Dec 18, 2024

aaronrosser Dec 18, 2024

aaronrosser Dec 18, 2024

maru-tw left a comment

maru-tw Dec 18, 2024

aaronrosser Dec 18, 2024

maru-tw Dec 18, 2024

wise-github-bot-app bot commented Dec 18, 2024

aaronrosser Dec 18, 2024 •

edited

Loading

wise-github-bot-app bot commented Dec 18, 2024

aaronrosser Dec 18, 2024 •

edited

Loading

xSeagullx Dec 18, 2024

aaronrosser Dec 18, 2024

xSeagullx Dec 18, 2024

aaronrosser Dec 18, 2024 •

edited

Loading

		long timeTillProcessingStarted = epochMilliBeforeProcessing - taskTriggering.getTriggerAt().toEpochMilli();
		meterCache.timer(METRIC_TASKS_TASK_GRABBING_TIME, tags).record(timeTillProcessingStarted, TimeUnit.MILLISECONDS);


		coreMetricsTemplate.registerTaskGrabbingResponse(bucketId, type, priority, processTaskResponse);
		long epochMilliBeforeProcessing = System.currentTimeMillis();

	tasksGrabbingExecutor.submit(() -> {
	try {
	grabTaskForProcessing0(bucket, task, concurrencyPolicy, taskHandler);
	} catch (Throwable t) {
	log.error("Task grabbing failed for '{}'.", task.getVersionId(), t);
	}
	});

	meterCache.timer(METRIC_TASKS_PROCESSING_TIME, TagsSet.of(TAG_BUCKET_ID, resolvedBucketId, TAG_TASK_TYPE, taskType,
	TAG_PROCESSING_RESULT, processingResult))
	.record(TwContextClockHolder.getClock().millis() - processingStartTimeMs, TimeUnit.MILLISECONDS);

	TaskTriggering taskTriggering = new TaskTriggering()
	.setTask(task)
	.setBucketId(processingBucketId)
	.setTriggerAt(Instant.now());

	TaskTriggering taskTriggering = new TaskTriggering()
	.setTask(task)
	.setBucketId(bucketId)
	.setOffset(offset)
	.setTriggerAt(Instant.ofEpochMilli(consumerRecord.timestamp()))
	.setTopicPartition(topicPartition);


		meterCache.counter(METRIC_TASKS_TASK_GRABBING, tags).increment();

		long millisSinceTaskTriggered = System.currentTimeMillis() - taskTriggeredAt.toEpochMilli();

Record the time between a task being triggered and it being successful run #217

Record the time between a task being triggered and it being successful run #217

Conversation

aaronrosser commented Dec 12, 2024 • edited Loading

Context

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maru-tw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wise-github-bot-app bot commented Dec 18, 2024

aaronrosser Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

wise-github-bot-app bot commented Dec 18, 2024

aaronrosser Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronrosser Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

aaronrosser commented Dec 12, 2024 •

edited

Loading

aaronrosser Dec 18, 2024 •

edited

Loading

aaronrosser Dec 18, 2024 •

edited

Loading

aaronrosser Dec 18, 2024 •

edited

Loading