Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast: don't auto-expire broadcasts #6308

Open
ColemanTom opened this issue Aug 20, 2024 · 10 comments
Open

broadcast: don't auto-expire broadcasts #6308

ColemanTom opened this issue Aug 20, 2024 · 10 comments
Assignees
Labels
bug Something is wrong :( code refactor Large code refactors
Milestone

Comments

@ColemanTom
Copy link
Contributor

ColemanTom commented Aug 20, 2024

[edit] We have decided to change our approach to broadcast expiry in light of this issue. I have now changed the issue title to match the new scope.

For a summary of the proposed changes, skip to #6308 (comment)

Discussion took place in https://cylc.discourse.group/t/cylc-broadcast-is-being-cleared-automatically/991/3

I'm posting this here to hope it doesn't get lost/forgotten.

Two of us encountered this same issue today. When you do a cylc broadcast on a task which has already run and succeeded, the broadcast is automatically cleared.

I was using CYLC_VERSION=8.3.1 and my colleague using CYLC_VERSION=8.3.3. We both first noticed it via the WUI but we have shown it also happens via CLI. I don’t know the behaviour in 8.2.

From the scheduler log:

2024-07-31T02:00:40Z INFO - Command "broadcast" received.
    broadcast(cycle_points=['20240327T0000Z'], mode=put_broadcast, namespaces=['ct_to_simpler'], settings=[{'environment': {'FORCE_RESEND': 'no'}}])
2024-07-31T02:00:40Z INFO - Broadcast set:
    + [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no
2024-07-31T02:00:40Z INFO - Broadcast cancelled:
    - [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no

Pausing the workflow, triggering the task, and then broadcasting allowed the broadcast to be kept.

2024-07-31T02:04:48Z INFO - Pausing the workflow
2024-07-31T02:04:48Z INFO - Command "pause" actioned. ID=0ea0b0b8-5b94-4d06-b90b-81b43962e8a9
2024-07-31T02:05:06Z INFO - Command "force_trigger_tasks" received. ID=8055b39c-10e3-4ccf-97ac-ed3a9bdd5d17
    force_trigger_tasks(flow=['all'], flow_wait=False, tasks=['20240327T0000Z/ct_to_simpler'])
2024-07-31T02:05:07Z INFO - [20240327T0000Z/ct_to_simpler:waiting(runahead)] => waiting
2024-07-31T02:05:07Z INFO - [20240327T0000Z/ct_to_simpler:waiting] => waiting(queued)
2024-07-31T02:05:07Z INFO - Command "force_trigger_tasks" actioned. ID=8055b39c-10e3-4ccf-97ac-ed3a9bdd5d17
2024-07-31T02:06:07Z INFO - Command "broadcast" received.
    broadcast(cycle_points=['20240327T0000Z'], mode=put_broadcast, namespaces=['ct_to_simpler'], settings=[{'environment': {'FORCE_RESEND': 'no'}}])
2024-07-31T02:06:07Z INFO - Broadcast set:
    + [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no
2024-07-31T02:07:13Z INFO - Command "resume" received. ID=d74fcf91-6398-4317-b4e2-a251c290029c
    resume()
2024-07-31T02:07:14Z INFO - RESUMING the workflow now
2024-07-31T02:07:14Z INFO - Command "resume" actioned. ID=d74fcf91-6398-4317-b4e2-a251c290029c
2024-07-31T02:07:14Z INFO - [20240327T0000Z/ct_to_simpler:waiting(queued)] => waiting
2024-07-31T02:07:14Z INFO - [20240327T0000Z/ct_to_simpler:waiting] => preparing
2024-07-31T02:07:19Z INFO - [20240327T0000Z/ct_to_simpler/05:preparing] submitted to user-dm_calthunder_d:pbs[6195899]
2024-07-31T02:07:19Z INFO - [20240327T0000Z/ct_to_simpler/05:preparing] => submitted
2024-07-31T02:07:27Z INFO - [20240327T0000Z/ct_to_simpler/05:submitted] => running
2024-07-31T02:07:31Z INFO - [20240327T0000Z/ct_to_simpler/05:running] => succeeded
2024-07-31T02:07:31Z INFO - [20240327T0000Z/ct_archive:succeeded] already finished and completed (flows=1))
2024-07-31T02:07:32Z INFO - Broadcast cancelled:
    - [20240327T0000Z/ct_to_simpler] [environment]FORCE_RESEND=no

What were we trying to do? Say we generated some data, and something was found wrong with it in a future task, or the disk got corrupted and we want to rerun the task again with a slightly different setting to delete the old data perhaps. We modify an environment variable in the runtime and run the task. As we can see the task in the WUI and broadcast to it via CLI, we were under the impression we could just do the broadcast and then trigger the task, but that does not work. We can’t see a way to do trigger new task with modified runtime, so our only option appears to be to pause the whole suite. We did not test hold -> trigger -> broadcast -> release.

Desired behaviour

We would like a way to modify the runtime of a succeeded task and trigger it without having to pause, trigger, edit runtime, unpause a workflow to do it.

@ColemanTom ColemanTom added the bug Something is wrong :( label Aug 20, 2024
@MetRonnie MetRonnie added this to the 8.3.x milestone Aug 20, 2024
@oliver-sanders
Copy link
Member

I don't think that a broadcast made to a succeeded task should be automatically cleared. This may be leftover Cylc 7 logic, it really doesn't play well with the Cylc 8 concept of new-flows (which may have the same broadcast requirements as the original flow).

Hopefully an easy fix.

We would like a way to modify the runtime of a succeeded task and trigger it without having to pause, trigger, edit runtime, unpause a workflow to do it.

Note, cylc vr can achieve this.

@wxtim wxtim self-assigned this Aug 30, 2024
@wxtim
Copy link
Member

wxtim commented Sep 5, 2024

Currently broadcasts are "expired" (removed) when the cycles to which they refer to have passed. Disabling this seems simple. But I don't think that we want to allow broadcasts to accumulate without housekeeping forever.

I'm going to propose that tasks can consume broadcasts, somewhere towards the run of a task.

Question - should the broadcast be considered cleanable on completion or success?

@wxtim wxtim added the question Flag this as a question for the next Cylc project meeting. label Sep 5, 2024
@oliver-sanders
Copy link
Member

Question - should the broadcast be considered cleanable on completion or success?

That's the idea.

When the task is removed from the task-pool as completed, the broadcast can be cleared.

(note task output completion and task pool removal aren't strictly equivalent)

@oliver-sanders oliver-sanders changed the title Modify runtime of a succeeded task so you can trigger it broadcast: modify runtime of a succeeded task so you can trigger it Oct 7, 2024
@simonathompson
Copy link

Just wanted to add, that I've just had a chat with Ronnie in the cylc surgery describing (completely independently) this issue. My context is a regular cycling workflow with no cycle-to-cycle dependencies, where I want to rerun a past cycle. I issue a broadcast at the beginning of a cycle to the family that over-arches all the other tasks in a cycle, that effectively tells it which HPC hall to point to. If I try this in cylc 8.3.3 the broadcast just vanishes as there are no active tasks, and because it's the first thing a cycle does. Fortunately I can get it to do what I want using a new cylc flow. Either way, you end up with either housekeeping broadcasts, or housekeeping flows! Thanks.

@hjoliver
Copy link
Member

hjoliver commented Oct 9, 2024

If I try this in cylc 8.3.3 the broadcast just vanishes as there are no active tasks,

Broadcast information is held in the scheduler, to be used for future tasks - which don't have to be in the active pool yet. So lack of active tasks isn't the problem here, it's that that you're targeting past tasks that you want to rerun, and the broadcast clearing mechanism assumes that past tasks are done with, so the broadcast can be cleared. (Which is indeed exactly what this issue is about ... just clarifying the logic in this case).

@oliver-sanders
Copy link
Member

Meeting 2024-11-04

This issue was recently discussed at a Cylc developers meeting, conclusions:

  • Auto expiring broadcasts on cycle completion (i.e. housekeeping) doesn't make sense.
  • While this has always been the case, Cylc 8 makes the consequences more impactfull:
    • It is easier to re-run historical tasks and multi-flow functionality encourages this.
    • Edit runtime forms are likely to be inaccurate for n>0 tasks.
    • SoD makes it slightly harder for users to tell what cycles are "open" vs "closed" (need to change the window to n=0).
  • The intention of this housekeeping was likely memory management, not functionality.
  • We discussed and agreed an alternative approach where broadcasts are not housekept:
    • Do not auto-expire broadcasts when cycles "complete". Allow the broadcasts to pile up in the DB.
    • Only hold broadcasts that match cycles in the data-store window in memory.
    • When a new cycle is added to the data store (when a task is added for a cycle not present in the data store), retrieve any matching broadcasts from the DB.
    • When a cycle is remove from the data store (when a task is removed leaving no others in the data store), ditch any broadcasts that are no longer relevant.
  • Concerns for development:

@wxtim wxtim removed the question Flag this as a question for the next Cylc project meeting. label Nov 5, 2024
@oliver-sanders oliver-sanders added the code refactor Large code refactors label Nov 5, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Nov 6, 2024

From initial investigations:

  • The broadcast_events table in the DB is not housekept, so we are already preserving a record of all broadcasts.
  • At present the BroadcastManager state is assembled in memory and probably(?) cannot be reconstructed from the broadcast_events alone. Hence, we have the broadcast_states table.
  • The broadcast_states table in the DB (which backs up the BroadcastManager state) can be removed with this change (as the BroadcastMananger state can be reconstructed from the DB). This will have impacts on cylc-review.
  • Broadcasts already have unique IDs in the form of (high precision) timestamps (broadcast: operate on broadcasts by ID #6402).
  • The SQL BETWEEN operator should allow us to efficiently select matching entries from the broadcast_events table.
  • At present, an all-cycle broadcast can take one of three values *, all-cycles or all-cycle-points, all of which are equivalent. We should standardise on * with this change for simplicity.

@oliver-sanders oliver-sanders self-assigned this Nov 6, 2024
@hjoliver
Copy link
Member

hjoliver commented Nov 7, 2024

At present, an all-cycle broadcast can take one of three values *, all-cycles or all-cycle-points, all of which are equivalent.

Huh, I wonder how we ended up with that?

@oliver-sanders oliver-sanders modified the milestones: 8.3.x, 8.5.0 Nov 26, 2024
@oliver-sanders oliver-sanders changed the title broadcast: modify runtime of a succeeded task so you can trigger it broadcast: don't auto-expire broadcasts Nov 26, 2024
@oliver-sanders
Copy link
Member

oliver-sanders commented Nov 26, 2024

At present, an all-cycle broadcast can take one of three values *, all-cycles or all-cycle-points, all of which are equivalent.

Huh, I wonder how we ended up with that?

No idea, but I intend to wipe it out!


Implementing the above requires a small/medium size code refactor. I've got a proof-of-concept branch working, but there's a lot in the pipes at the moment so I have tagged against 8.5.0.

Although the Cylc 8 behavior is unchanged in relation to Cylc 7, it's not very nice and adds caveats to interventions performed via broadcasts so I have added this to the roadmap.

@oliver-sanders
Copy link
Member

My POC for anyone interested: https://github.com/oliver-sanders/cylc-flow/pull/new/broadcast-refactor

(This is a parallel implementation, swap out the existing broadcast manager in the Scheduler class to activate it. Note, it ain't gonna work just yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( code refactor Large code refactors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants