Task events for scheduler #1956

jpbruinsslot · 2023-10-23T15:09:42Z

Related issues:

Show job duration on task lists #1161

Currently, we only have created_at and modified_at timestamp information for tasks. Status changes can only be derived from the status and the modified_at, any other mutations and events are lost.

It would be helpful to know how long a task was queued, and how long this task took when it was picked up and finished.

The most straightforward solutions seems to be adding more datetime columns for these state changes or adding an events table with a 1:m relation between a task and task events that then can be referenced from a task.

The text was updated successfully, but these errors were encountered:

underdarknl · 2023-10-24T12:15:27Z

Future work on this could also include which runner-instance executed the job, and how many resources where consumed (cpu, ram, network-io), as those could be useful for the scheduler to rank on for future jobs on similar objects.

jpbruinsslot · 2023-11-27T10:06:20Z

During a discussion with Donny about the current implementation, we condensed the following considerations:

Considerations

If we want, we can still opt to use columns with the runtime, queued, and duration fields instead of adding this to an events table, however this can limit the extension of using additional information like cpu, memory, network i/o load, unless adding additional database columns.
Usage of postgres trigger, could be considered as a bit of obfuscation, and a bit of magic that is done in the 'background'. We probably want to us Python instead, on tasks updates from the api
Additionally we probably want to move away from those triggers if the task runner creates events like cpu, memory load etc.
Scalability, when considering high loads we can end up with a large events table, and perhaps partitioning should be considered, or compacting on date.
~~The queries for duration, queued, and runtime should only be done when requested. It should not be done for the task list endpoint~~ has been implemented
The fields type, context, and events can superfluous, and could be combined into 1 field and perhaps with dot notation

@underdarknl thoughts?

underdarknl · 2023-12-05T10:31:06Z

Adding columns would allow us to more easily add indices for events, however I'm not sure we have a good usecase for them. Having a single json blob with extensible fields is more likely to result in flexibility in consuming metrics as deliverable by the various plugin runners (eg, clouds have other resource counters, compared to local execution).
Database triggers should not be used in OpenKAT I'd say, as it splits the business logic over two domains. Deterministic view logic (eg, combining tables, or calculating averages), is something that I would consider to be something you'd want to do in database land since it can be obviously named columns.
the main usecase for these metrics is twofold:
** Show the user what resources a specific tool has consumed. This would cover all historical data, but is incidental and does not suffer from scalability issue as much.
** Calculate what a proposed Job is likely to consume based on 'recent similar jobs'. This means that we don't need all the data, just a most recent set. Maybe we can have a daily job that trims this materialized view into a two week or N items window to keep the math required for this calculation reasonable?
If we introduce blob fields, lets stick to something postgres has querying tools for, like json/jsonb.

jpbruinsslot · 2023-12-05T11:23:11Z

Adding columns would allow us to more easily add indices for events, however I'm not sure we have a good usecase for them. Having a single json blob with extensible fields is more likely to result in flexibility in consuming metrics as deliverable by the various plugin runners (eg, clouds have other resource counters, compared to local execution).

Database triggers should not be used in OpenKAT I'd say, as it splits the business logic over two domains. Deterministic view logic (eg, combining tables, or calculating averages), is something that I would consider to be something you'd want to do in database land since it can be obviously named columns.

the main usecase for these metrics is twofold:
** Show the user what resources a specific tool has consumed. This would cover all historical data, but is incidental and does not suffer from scalability issue as much.
** Calculate what a proposed Job is likely to consume based on 'recent similar jobs'. This means that we don't need all the data, just a most recent set. Maybe we can have a daily job that trims this materialized view into a two week or N items window to keep the math required for this calculation reasonable?

If we introduce blob fields, lets stick to something postgres has querying tools for, like json/jsonb.

Awesome, thanks for the feedback! In regards to indices, we still have that possibility to add indices to specific fields in the jsonb column if we wanted to, which is really great.

underdarknl · 2023-12-08T10:45:26Z

If we can store the metrics in the boefje meta we might not even need the 'archived' storage, since for most lookups on historical jobs we would already load the boefje meta from bytes anyway. This would simplify things to hold just the data the scheduler needs for its own priority calculations.
Depending on how much data we'd want for bayesian like predictions this would be less useful route though, as reading lots of meta files from bytes is much more expensive than using a un-indexed database table.

stephanie0x00 · 2023-12-08T17:00:56Z

Checklist for QA:

I have checked out this branch, and successfully ran a fresh make reset.
I confirmed that there are no unintended functional regressions in this branch:
- I have managed to pass the onboarding flow
- Objects and Findings are created properly
- Tasks are created and completed properly
I confirmed that the PR's advertised feature or hotfix works as intended.

What works:

From a UI point of view everything still works as it's supposed to be. The onboarding flow works, additional boefjes are scheduled as intended and they complete.

What doesn't work:

n/a.

Bug or feature?:

n/a

jpbruinsslot · 2024-04-11T14:18:25Z

TLDR

Potential solutions:

Creation of event table with a 1:m relation between task and events that track status change events
- @dekkers will potentially create a lot of records and might not be scalable with the expected high throughput
- @jpbruinsslot perhaps other stream processing solutions will be better suited for this
Use available created_at, modfied_at to calculate the time from creation to completion
- Does not necessarily communicate the running time of the task runner
Add duration field to a Task as a 32-bit integer @dekkers
- This only tracks the running time of a task runner and does not expose information on the duration of other statuses, unless additional duration fields are added for individual statuses and then we need to consider if we want to have those status durations in the task table or to normalize in order to not clutter the task table. @jpbruinsslot
- Rather see timestamps instead of durations @Donnype: this allows for sorting on start_at, easily calculate the "total lifetime" of a task (queued -> done) and queuing time (queued -> start_at),

Other discussions:

CPU/mem usage metrics seem specific towards a boefje runner and might not necessarily relate to other task runners. Additionally, we're not considering other metrics like disk usage and network usage which might not be relevant to other task runners. Considering the agnostic nature of the scheduler we then we need to evaluate whether to normalize those metrics fields as well. @jpbruinsslot
- We concluded that this needs to be discussed in another ticket since it also begs questions about ux: how is this going to be used, queried and visualized, and if other tools that track these metrics are better suited than to integrate this into the scheduler. i.e. cluster metrics about individual tasks runners in containers

jpbruinsslot · 2024-04-16T12:06:31Z

@underdarknl fyi see the comment above this one for the latest updates on this discussion

jpbruinsslot added the mula Issues related to the scheduler label Oct 23, 2023

jpbruinsslot self-assigned this Oct 23, 2023

jpbruinsslot added this to KAT Oct 23, 2023

github-project-automation bot moved this to Incoming features / Need assessment in KAT Oct 23, 2023

jpbruinsslot moved this from Incoming features / Need assessment to Todo (In this sprint) in KAT Nov 9, 2023

jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Nov 14, 2023

jpbruinsslot mentioned this issue Nov 20, 2023

Add task events to the scheduler #2043

Closed

6 tasks

jpbruinsslot moved this from In Progress to Blocked / To be discussed in KAT Nov 28, 2023

Rieven assigned underdarknl Nov 29, 2023

jpbruinsslot moved this from Blocked / To be discussed to In Progress in KAT Dec 5, 2023

jpbruinsslot unassigned underdarknl Dec 5, 2023

jpbruinsslot mentioned this issue Dec 7, 2023

Task events metrics for scheduler #2100

Open

jpbruinsslot moved this from In Progress to Review in KAT Dec 7, 2023

jpbruinsslot removed their assignment Dec 11, 2023

jpbruinsslot moved this from Review to Blocked / To be discussed in KAT Dec 12, 2023

jpbruinsslot self-assigned this Dec 12, 2023

jpbruinsslot moved this from Blocked / To be discussed to In Progress in KAT Dec 18, 2023

jpbruinsslot moved this from In Progress to Todo (In this sprint) in KAT Dec 20, 2023

underdarknl added this to the OpenKAT v1.15 milestone Dec 22, 2023

jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Dec 28, 2023

jpbruinsslot linked a pull request Jan 2, 2024 that will close this issue

Alternative: Add task events to the scheduler #2214

Closed

jpbruinsslot moved this from In Progress to Review in KAT Jan 2, 2024

jpbruinsslot removed their assignment Jan 18, 2024

Donnype assigned jpbruinsslot Jan 23, 2024

Donnype assigned underdarknl Jan 25, 2024

jpbruinsslot moved this from Review to Backlog / Refined tasks in KAT Jan 30, 2024

jpbruinsslot moved this from Backlog / Refined tasks to In Progress in KAT Jan 31, 2024

jpbruinsslot linked a pull request Feb 7, 2024 that will close this issue

Add task duration intervals to the scheduler #2451

Draft

jpbruinsslot moved this from In Progress to Review in KAT Feb 7, 2024

jpbruinsslot moved this from Review to Todo (In this sprint) in KAT Mar 4, 2024

jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Mar 6, 2024

jpbruinsslot moved this from In Progress to Review in KAT Mar 7, 2024

madelondohmen moved this from Review to To be discussed in KAT Mar 14, 2024

jpbruinsslot removed the status in KAT Mar 28, 2024

jpbruinsslot moved this to Incoming features / Need assessment in KAT Apr 11, 2024

underdarknl modified the milestones: OpenKAT v1.15, OpenKAT v1.16 May 3, 2024

dekkers removed this from the OpenKAT v1.16 milestone Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task events for scheduler #1956

Task events for scheduler #1956

jpbruinsslot commented Oct 23, 2023

underdarknl commented Oct 24, 2023

jpbruinsslot commented Nov 27, 2023 •

edited

Loading

underdarknl commented Dec 5, 2023

jpbruinsslot commented Dec 5, 2023

underdarknl commented Dec 8, 2023

stephanie0x00 commented Dec 8, 2023

jpbruinsslot commented Apr 11, 2024 •

edited

Loading

jpbruinsslot commented Apr 16, 2024

Task events for scheduler #1956

Task events for scheduler #1956

Comments

jpbruinsslot commented Oct 23, 2023

underdarknl commented Oct 24, 2023

jpbruinsslot commented Nov 27, 2023 • edited Loading

Considerations

underdarknl commented Dec 5, 2023

jpbruinsslot commented Dec 5, 2023

underdarknl commented Dec 8, 2023

stephanie0x00 commented Dec 8, 2023

Checklist for QA:

What works:

What doesn't work:

Bug or feature?:

jpbruinsslot commented Apr 11, 2024 • edited Loading

jpbruinsslot commented Apr 16, 2024

jpbruinsslot commented Nov 27, 2023 •

edited

Loading

jpbruinsslot commented Apr 11, 2024 •

edited

Loading