Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch failure to open Worker and write out port #85

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jkrumbiegel
Copy link

We encountered a failure on CI where starting a Malt worker resulted in an obscure error. Something like:

julia> worker = Malt.Worker(; exeflags = ["-t invalid"])

ERROR: ArgumentError: input string is empty or only contains whitespace
Stacktrace:
 [1] tryparse_internal(::Type{UInt16}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool)
   @ Base ./parse.jl:115
 [2] parse(::Type{UInt16}, s::String; base::Nothing)
   @ Base ./parse.jl:254
 [3] parse
   @ ./parse.jl:253 [inlined]
 [4] Malt.Worker(; env::Vector{String}, exeflags::Vector{String})
   @ Malt ~/.julia/dev/Malt/src/Malt.jl:118
 [5] top-level scope

Turns out when a worker crashes before a port is written back, stdout is read back as "" so you get an error when parsing that as the port number. But the underlying reason is not visible.

I tried here to read from stderr to report back what happened if reading the port fails. I didn't know off the top of my head how to do this more elegantly. I'm also not sure if proc could stall if the worker manages to start up correctly but then writes stuff there over time but nobody reads the pipe.

With this change you get this behavior on crashes:

julia> worker = Malt.Worker(; exeflags = ["-t invalid"])
ERROR: Failed to start worker process correctly. Expected to read port from stdout, got "" instead. Stderr output was:
ERROR: julia: -t,--threads=<n>[,auto|<m>]; n must be an integer >= 1

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] Malt.Worker(; env::Vector{String}, exeflags::Vector{String})
   @ Malt ~/.julia/dev/Malt/src/Malt.jl:131
 [3] top-level scope
   @ REPL[34]:1

cc @MichaelHatherly

Base.kill(proc, Base.SIGTERM)
close(err.in)
err_t = Threads.@spawn read(err, String)
err_output = fetch(err_t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just err_output = read(err, String)? Can you add a comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think this was an artifact of me moving things around. It doesn't make sense this way anymore.

test/exceptions.jl Outdated Show resolved Hide resolved
@fonsp
Copy link
Member

fonsp commented Oct 3, 2024

Thanks! Sorry for the delay, I took a small break :)

My hesitation is what happens when the worker writes to stderr during normal use (after launch). I can't remember where this currently goes, but we need to make sure in this PR that this is unaffected.

If stderr is currently ignored, then this PR is actually a good opportunity to fix that as well :)

@fonsp
Copy link
Member

fonsp commented Oct 3, 2024

Ah there is #82 about this

@fonsp
Copy link
Member

fonsp commented Oct 10, 2024

Before this PR

julia> Malt.remote_eval_fetch(worker, :(println(stdout, 123)))

julia> Malt.remote_eval_fetch(worker, :(println(stderr, 123)))
123

After this PR

julia> Malt.remote_eval_fetch(worker, :(println(stdout, 123)))

julia> Malt.remote_eval_fetch(worker, :(println(stderr, 123)))

It looks like both implementations are incorrect/undesirable, see #82. Let's implement this PR together with fixing #82.

@fonsp
Copy link
Member

fonsp commented Oct 10, 2024

I wonder how to do that, first capture a process's stdout to a buffer (to read the port on stdout and launch error on stderr), and the switching to redirecting it to stdout).

EDIT: looks like Distributed does it with @async read write: #82 (comment) . That shouldn't be too hard to implement!

@svilupp
Copy link

svilupp commented Dec 16, 2024

This PR is a great idea! We've had several of these failures today and are not sure why it started happening:

PumasAI/QuartoNotebookRunner.jl#223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants