Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test worker_steal_count hangs sometimes #6470

Closed
Darksonn opened this issue Apr 7, 2024 · 3 comments · Fixed by #6932
Closed

Test worker_steal_count hangs sometimes #6470

Darksonn opened this issue Apr 7, 2024 · 3 comments · Fixed by #6932
Labels
A-ci Area: The continuous integration setup A-tokio Area: The main tokio crate E-help-wanted Call for participation: Help is requested to fix this issue. M-metrics Module: tokio/runtime/metrics

Comments

@Darksonn
Copy link
Contributor

Darksonn commented Apr 7, 2024

I tried fixing this in the past with #6327, but it seems like that attempt was not enough to fix it, because it's still happening.

#[test]
fn worker_steal_count() {
// This metric only applies to the multi-threaded runtime.
//
// We use a blocking channel to backup one worker thread.
use std::sync::mpsc::channel;
let rt = threaded_no_lifo();
let metrics = rt.metrics();
rt.block_on(async {
let (tx, rx) = channel();
// Move to the runtime.
tokio::spawn(async move {
// Spawn the task that sends to the channel
//
// Since the lifo slot is disabled, this task is stealable.
tokio::spawn(async move {
tx.send(()).unwrap();
});
// Blocking receive on the channel.
rx.recv().unwrap();
})
.await
.unwrap();
});
drop(rt);
let n: u64 = (0..metrics.num_workers())
.map(|i| metrics.worker_steal_count(i))
.sum();
assert_eq!(1, n);
}

For now, I've ignored the test in #6471, but it should be fixed.

@Darksonn Darksonn added A-tokio Area: The main tokio crate A-ci Area: The continuous integration setup M-metrics Module: tokio/runtime/metrics labels Apr 7, 2024
@Darksonn Darksonn added the E-help-wanted Call for participation: Help is requested to fix this issue. label Apr 7, 2024
@ineol
Copy link

ineol commented May 10, 2024

I couldn't figure it out, but I have a few observations that could hopefully be useful.

  1. The usual cause of non-termination is that the inner task is never scheduled, and one worker gets parked nonethelesss
  2. If one adds a call to std::thread::sleep in runtime::scheduler::multi_thread::worker::Context::run after trying to sleep and before parking. (this line)
  3. Always notifying in this function called just before parking a core seems to improve the situation: without the thread::sleep mentioned above, I cannot reproduce the bug anymore (in 10^6 runs)
    self.is_searching = false;
    if is_last_searcher {
    worker.handle.notify_if_work_pending();
    }

    With this change, the send line 196 of the test runs, and the receive happens in the outer task, but it still not terminate sometimes, which confused me.

It could be that somehow,

  1. while worker 1 is searching for work in the global queue, worker 2 does not transition to searching because of this condition, then it fails to find work.
  2. then, worker 1 runs the task and enqueues a task to its local queue, and gets stuck at rx.recv()
  3. then worker 2 parks, the transition_to_parked runs, and there are zero workers in a searching state, and so it doesn't try to notify, and it gets parked
  4. the final result is that one worker is executing the task, waiting on the channel, and the other worker is parked.

Instead of always trying to notify, changing the condition below with (is_searching as usize) == prev.num_searching() has the same effect the change in point 3 above.

/// Returns `true` if this is the final searching worker.
fn dec_num_unparked(cell: &AtomicUsize, is_searching: bool) -> bool {
let mut dec = 1 << UNPARK_SHIFT;
if is_searching {
dec += 1;
}
let prev = State(cell.fetch_sub(dec, SeqCst));
is_searching && prev.num_searching() == 1
}

@jofas
Copy link
Contributor

jofas commented Oct 21, 2024

Isn't this the same problematic blocking of the runtime that was observed in #6847 and could be fixed by applying the fix in #6916 to this test?

@Darksonn
Copy link
Contributor Author

That sounds likely!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ci Area: The continuous integration setup A-tokio Area: The main tokio crate E-help-wanted Call for participation: Help is requested to fix this issue. M-metrics Module: tokio/runtime/metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants