Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task events for scheduler #1956

Open
jpbruinsslot opened this issue Oct 23, 2023 · 8 comments · May be fixed by #2451
Open

Task events for scheduler #1956

jpbruinsslot opened this issue Oct 23, 2023 · 8 comments · May be fixed by #2451
Assignees
Labels
mula Issues related to the scheduler

Comments

@jpbruinsslot
Copy link
Contributor

Related issues:

Currently, we only have created_at and modified_at timestamp information for tasks. Status changes can only be derived from the status and the modified_at, any other mutations and events are lost.

It would be helpful to know how long a task was queued, and how long this task took when it was picked up and finished.

The most straightforward solutions seems to be adding more datetime columns for these state changes or adding an events table with a 1:m relation between a task and task events that then can be referenced from a task.

@jpbruinsslot jpbruinsslot added the mula Issues related to the scheduler label Oct 23, 2023
@jpbruinsslot jpbruinsslot self-assigned this Oct 23, 2023
@jpbruinsslot jpbruinsslot added this to KAT Oct 23, 2023
@github-project-automation github-project-automation bot moved this to Incoming features / Need assessment in KAT Oct 23, 2023
@underdarknl
Copy link
Contributor

Future work on this could also include which runner-instance executed the job, and how many resources where consumed (cpu, ram, network-io), as those could be useful for the scheduler to rank on for future jobs on similar objects.

@jpbruinsslot jpbruinsslot moved this from Incoming features / Need assessment to Todo (In this sprint) in KAT Nov 9, 2023
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Nov 14, 2023
@jpbruinsslot
Copy link
Contributor Author

jpbruinsslot commented Nov 27, 2023

During a discussion with Donny about the current implementation, we condensed the following considerations:

Considerations

  • If we want, we can still opt to use columns with the runtime, queued, and duration fields instead of adding this to an events table, however this can limit the extension of using additional information like cpu, memory, network i/o load, unless adding additional database columns.
  • Usage of postgres trigger, could be considered as a bit of obfuscation, and a bit of magic that is done in the 'background'. We probably want to us Python instead, on tasks updates from the api
  • Additionally we probably want to move away from those triggers if the task runner creates events like cpu, memory load etc.
  • Scalability, when considering high loads we can end up with a large events table, and perhaps partitioning should be considered, or compacting on date.
  • The queries for duration, queued, and runtime should only be done when requested. It should not be done for the task list endpoint has been implemented
  • The fields type, context, and events can superfluous, and could be combined into 1 field and perhaps with dot notation

@underdarknl thoughts?

@jpbruinsslot jpbruinsslot moved this from In Progress to Blocked / To be discussed in KAT Nov 28, 2023
@underdarknl
Copy link
Contributor

  • Adding columns would allow us to more easily add indices for events, however I'm not sure we have a good usecase for them. Having a single json blob with extensible fields is more likely to result in flexibility in consuming metrics as deliverable by the various plugin runners (eg, clouds have other resource counters, compared to local execution).
  • Database triggers should not be used in OpenKAT I'd say, as it splits the business logic over two domains. Deterministic view logic (eg, combining tables, or calculating averages), is something that I would consider to be something you'd want to do in database land since it can be obviously named columns.
  • the main usecase for these metrics is twofold:
    ** Show the user what resources a specific tool has consumed. This would cover all historical data, but is incidental and does not suffer from scalability issue as much.
    ** Calculate what a proposed Job is likely to consume based on 'recent similar jobs'. This means that we don't need all the data, just a most recent set. Maybe we can have a daily job that trims this materialized view into a two week or N items window to keep the math required for this calculation reasonable?
  • If we introduce blob fields, lets stick to something postgres has querying tools for, like json/jsonb.

@jpbruinsslot
Copy link
Contributor Author

  • Adding columns would allow us to more easily add indices for events, however I'm not sure we have a good usecase for them. Having a single json blob with extensible fields is more likely to result in flexibility in consuming metrics as deliverable by the various plugin runners (eg, clouds have other resource counters, compared to local execution).
  • Database triggers should not be used in OpenKAT I'd say, as it splits the business logic over two domains. Deterministic view logic (eg, combining tables, or calculating averages), is something that I would consider to be something you'd want to do in database land since it can be obviously named columns.
  • the main usecase for these metrics is twofold:
    ** Show the user what resources a specific tool has consumed. This would cover all historical data, but is incidental and does not suffer from scalability issue as much.
    ** Calculate what a proposed Job is likely to consume based on 'recent similar jobs'. This means that we don't need all the data, just a most recent set. Maybe we can have a daily job that trims this materialized view into a two week or N items window to keep the math required for this calculation reasonable?
  • If we introduce blob fields, lets stick to something postgres has querying tools for, like json/jsonb.

Awesome, thanks for the feedback! In regards to indices, we still have that possibility to add indices to specific fields in the jsonb column if we wanted to, which is really great.

@jpbruinsslot jpbruinsslot moved this from Blocked / To be discussed to In Progress in KAT Dec 5, 2023
@jpbruinsslot jpbruinsslot moved this from In Progress to Review in KAT Dec 7, 2023
@underdarknl
Copy link
Contributor

If we can store the metrics in the boefje meta we might not even need the 'archived' storage, since for most lookups on historical jobs we would already load the boefje meta from bytes anyway. This would simplify things to hold just the data the scheduler needs for its own priority calculations.
Depending on how much data we'd want for bayesian like predictions this would be less useful route though, as reading lots of meta files from bytes is much more expensive than using a un-indexed database table.

@stephanie0x00
Copy link
Contributor

Checklist for QA:

  • I have checked out this branch, and successfully ran a fresh make reset.
  • I confirmed that there are no unintended functional regressions in this branch:
    • I have managed to pass the onboarding flow
    • Objects and Findings are created properly
    • Tasks are created and completed properly
  • I confirmed that the PR's advertised feature or hotfix works as intended.

What works:

From a UI point of view everything still works as it's supposed to be. The onboarding flow works, additional boefjes are scheduled as intended and they complete.

What doesn't work:

n/a.

Bug or feature?:

n/a

@jpbruinsslot jpbruinsslot removed their assignment Dec 11, 2023
@jpbruinsslot jpbruinsslot moved this from Review to Blocked / To be discussed in KAT Dec 12, 2023
@jpbruinsslot jpbruinsslot self-assigned this Dec 12, 2023
@jpbruinsslot jpbruinsslot moved this from Blocked / To be discussed to In Progress in KAT Dec 18, 2023
@jpbruinsslot jpbruinsslot moved this from In Progress to Todo (In this sprint) in KAT Dec 20, 2023
@underdarknl underdarknl added this to the OpenKAT v1.15 milestone Dec 22, 2023
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Dec 28, 2023
@jpbruinsslot jpbruinsslot linked a pull request Jan 2, 2024 that will close this issue
@jpbruinsslot jpbruinsslot moved this from In Progress to Review in KAT Jan 2, 2024
@jpbruinsslot jpbruinsslot removed their assignment Jan 18, 2024
@jpbruinsslot jpbruinsslot moved this from Review to Backlog / Refined tasks in KAT Jan 30, 2024
@jpbruinsslot jpbruinsslot moved this from Backlog / Refined tasks to In Progress in KAT Jan 31, 2024
@jpbruinsslot jpbruinsslot linked a pull request Feb 7, 2024 that will close this issue
@jpbruinsslot jpbruinsslot moved this from In Progress to Review in KAT Feb 7, 2024
@jpbruinsslot jpbruinsslot moved this from Review to Todo (In this sprint) in KAT Mar 4, 2024
@jpbruinsslot jpbruinsslot moved this from Todo (In this sprint) to In Progress in KAT Mar 6, 2024
@jpbruinsslot jpbruinsslot moved this from In Progress to Review in KAT Mar 7, 2024
@madelondohmen madelondohmen moved this from Review to To be discussed in KAT Mar 14, 2024
@jpbruinsslot jpbruinsslot removed the status in KAT Mar 28, 2024
@jpbruinsslot
Copy link
Contributor Author

jpbruinsslot commented Apr 11, 2024

TLDR

Potential solutions:

  • Creation of event table with a 1:m relation between task and events that track status change events
    • @dekkers will potentially create a lot of records and might not be scalable with the expected high throughput
    • @jpbruinsslot perhaps other stream processing solutions will be better suited for this
  • Use available created_at, modfied_at to calculate the time from creation to completion
    • Does not necessarily communicate the running time of the task runner
  • Add duration field to a Task as a 32-bit integer @dekkers
    • This only tracks the running time of a task runner and does not expose information on the duration of other statuses, unless additional duration fields are added for individual statuses and then we need to consider if we want to have those status durations in the task table or to normalize in order to not clutter the task table. @jpbruinsslot
    • Rather see timestamps instead of durations @Donnype: this allows for sorting on start_at, easily calculate the "total lifetime" of a task (queued -> done) and queuing time (queued -> start_at),

Other discussions:

  • CPU/mem usage metrics seem specific towards a boefje runner and might not necessarily relate to other task runners. Additionally, we're not considering other metrics like disk usage and network usage which might not be relevant to other task runners. Considering the agnostic nature of the scheduler we then we need to evaluate whether to normalize those metrics fields as well. @jpbruinsslot
    • We concluded that this needs to be discussed in another ticket since it also begs questions about ux: how is this going to be used, queried and visualized, and if other tools that track these metrics are better suited than to integrate this into the scheduler. i.e. cluster metrics about individual tasks runners in containers

@jpbruinsslot jpbruinsslot moved this to Incoming features / Need assessment in KAT Apr 11, 2024
@jpbruinsslot
Copy link
Contributor Author

@underdarknl fyi see the comment above this one for the latest updates on this discussion

@dekkers dekkers removed this from the OpenKAT v1.16 milestone Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Status: Incoming features / Need assessment
4 participants