Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeded RNGs only work reproducibly within a cell #192

Closed
penelopeysm opened this issue Oct 10, 2024 · 22 comments · Fixed by #194
Closed

Seeded RNGs only work reproducibly within a cell #192

penelopeysm opened this issue Oct 10, 2024 · 22 comments · Fixed by #194
Assignees

Comments

@penelopeysm
Copy link

penelopeysm commented Oct 10, 2024

Description

Calling Random.seed! at the top of a notebook doesn't ensure reproducibility of random number generation in other cells; it only works within the same cell that seed! is called.

I'm not sure if this is intentional, but it's slightly counterintuitive behaviour so if it's intentional it might be good to document it somewhere.

A related issue (and perhaps an example of how this behaviour is confusing!) is that the Pumas tutorial on reproducibility says that the second batch of samples in Random.seed!(123); rand(3); rand(3) should always result in "the same behavior [...] every time we run the code". This is true if done in a REPL, but the Quarto source has the second rand(3) in a different cell and consequently the second rand(3) will not always return the same numbers.

MWE

The following notebook works to demonstrate the behaviour. The first call to randn() should always give the same result no matter how many times the notebook is rendered. The second call however will not.

---
engine: julia
---

```{julia}
versioninfo()
```

```{julia}
using Random
Random.seed!(42)
randn()
```

```{julia}
randn()
```

My versioninfo from the notebook above

Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_LOAD_PATH = @:@stdlib
@devmotion
Copy link
Contributor

devmotion commented Oct 11, 2024

I think this might be an upstream issue but I'm not too familiar with Malt so maybe it's intended and/or could be avoided:

julia> using Malt

julia> w = Malt.Worker();

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> Malt.remote_eval_fetch(w, quote
           randn()
       end)
0.8653602653617717

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> Malt.remote_eval_fetch(w, quote
           randn()
       end)
-0.6715764280647957

@MichaelHatherly MichaelHatherly self-assigned this Oct 11, 2024
@MichaelHatherly
Copy link
Collaborator

Will investigate, thanks for the thorough issue description @penelopeysm.

@jkrumbiegel
Copy link
Collaborator

jkrumbiegel commented Oct 11, 2024

Maybe this has something to do with task-local RNG state? Could be that Malt evals in different tasks per fetch and that this has an effect on the RNG. If they were doing some random number generation themselves that changed the RNG state, then it should still lead to the same numbers if it was the same n operations between two remote_eval_fetches so that can't be it.

@devmotion
Copy link
Contributor

Hmm, the RNG object id seems to be stable, so it seems to be the same RNG?

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

@jkrumbiegel
Copy link
Collaborator

I think that's because it's TaskLocalRNG and that's a singleton type:

julia> TaskLocalRNG() === TaskLocalRNG()
true

The RNG state for that is stored in the task itself.

@MichaelHatherly
Copy link
Collaborator

cc @fonsp in case you happen to have any insights.

@jkrumbiegel
Copy link
Collaborator

jkrumbiegel commented Oct 11, 2024

This shows that we get different task-local RNG state in a new remote_eval_fetch:

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(123)

           rngfields = filter(fieldnames(Task)) do field
               startswith(string(field), "rng")
           end
           state = NamedTuple(map(rngfields) do field
               field => getfield(Base.current_task(), field)
           end)
       end)
(rngState0 = 0xfefa8d41b8f5dca5, rngState1 = 0xf80cc98e147960c1, rngState2 = 0x20e2ccc17662fc1d, rngState3 = 0xea7a7dcb2e787c01, rngState4 = 0xf4e85a418b9c4f80)

julia> Malt.remote_eval_fetch(w, quote
           rngfields = filter(fieldnames(Task)) do field
               startswith(string(field), "rng")
           end
           state = NamedTuple(map(rngfields) do field
               field => getfield(Base.current_task(), field)
           end)
       end)
(rngState0 = 0x4e291591396795b9, rngState1 = 0xc9a680b08b4c9006, rngState2 = 0x3bb5088d1c600f58, rngState3 = 0xe2bb063e7e800a67, rngState4 = 0xe264ff418d928577)

Probably because each invocation is a different task?

julia> Malt.remote_eval_fetch(w, quote
           objectid(Base.current_task())
       end)
0xb68e01fb526e2337

julia> Malt.remote_eval_fetch(w, quote
           objectid(Base.current_task())
       end)
0x49246d797797f2ad

@jkrumbiegel
Copy link
Collaborator

If this is hard to fix in Malt, I guess we could store the rngState fields after a task runs a cell, and then first thing in the next cell, mutate that one's rngState fields so they take the same values. Not sure if Malt can be configured to use only a single task.

@devmotion
Copy link
Contributor

Maybe one could take some inspiration from how Test resets the task-local RNG: https://github.com/JuliaLang/julia/blob/1438b1578941d0f1cc8f8c958cf3bd2927fd482c/stdlib/Test/src/Test.jl#L1694-L1720

@jkrumbiegel
Copy link
Collaborator

Interesting, it seems this looked quite different on 1.10 where Random.get_tls_seed() did not exist

https://github.com/JuliaLang/julia/blob/6f3fdf7b36250fb95f512a2b927ad2518c07d2b5/stdlib/Test/src/Test.jl#L1567-L1578

@fonsp
Copy link

fonsp commented Oct 11, 2024

@Pangoraw knows best about RNGs and threading in Pluto :)

@Pangoraw
Copy link

remote_eval_fetch indeed spawns a new task at each call. One solution would be for QuartoNotebookWorker to manage its own persistent task for evaluation and then communicate with this task using channels.

@devmotion
Copy link
Contributor

The more I thought about this issue, the more I got the feeling that this is an upstream issue in Malt. The behaviour seems completely surprising, in particular given that everything else is consistent between subsequent Malt.remote_eval_fetch calls and there were not tasks involved in the to-be-evaluated expressions sent to the worker (additionally, in my example above the worker used a single thread). Additionally, it seemed the behaviour was not documented, so I assumed possibly it was not even intentional.

But then I tried the same example with Distributed:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> remotecall_fetch(eval, 2, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> remotecall_fetch(eval, 2, quote
           randn()
       end)
0.08641660475950964

julia> remotecall_fetch(eval, 2, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> remotecall_fetch(eval, 2, quote
           randn()
       end)
1.559594774368473

julia> remotecall_fetch(eval, 2, quote
           objectid(Base.current_task())
       end)
0xebc117a52ab2a7be

julia> remotecall_fetch(eval, 2, quote
           objectid(Base.current_task())
       end)
0xe14f2bb195cecba7

So it seems Malt is consistent with Distributed. I couldn't find any issue related to it in the Distributed repo but in the Julia repo someone had opened an issue regarding this behaviour: JuliaLang/julia#51225 It also links to a Pluto issue: fonsp/Pluto.jl#2290

So my take away is that the behaviour of Malt is consistent with Distributed but that this behaviour - both in Distributed and in Malt - is confusing and surprising for users. Based on the discussion in the Julia issue, I assume Distributed won't be changed - but since there are issues from both Pluto and Quarto users, maybe it could be addressed in Malt? Maybe Malt could expose an API for evaluations within a persistent task?

@devmotion
Copy link
Contributor

devmotion commented Oct 11, 2024

This is an approach for Distributed (maybe it can be done more elegantly or efficiently?):

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> const in = RemoteChannel(() -> Channel{Expr}(0));

julia> const out = RemoteChannel(() -> Channel{Any}(0));

julia> @everywhere function evalloop(out, in)
           while true
               expr = take!(in)
               put!(out, eval(expr))
           end
       end

julia> remote_do(evalloop, 2, out, in)

julia> put!(in, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(out)
0.7883556016042917

julia> put!(in, quote
           randn()
       end);

julia> take!(out)
-0.8798585959543993

julia> put!(in, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(out)
0.7883556016042917

julia> put!(in, quote
           randn()
       end);

julia> take!(out)
-0.8798585959543993

What's the best way to do the same with Malt?

I tried the following without success:

julia> using Malt

julia> w = Malt.Worker();

julia> cin = Malt.worker_channel(w, :(cin = Channel{Expr}(0)));

julia> cout = Malt.worker_channel(w, :(cout = Channel{Any}(0)));

julia> Malt.remote_eval(w, quote
           while true
               expr = take!(cin)
               put!(cout, eval(expr))
           end
       end);

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);
ERROR: Remote exception from Malt.Worker on port 9591 with PID 96591:

MethodError: Cannot `convert` an object of type Float64 to an object of type Expr
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  Expr(::Any...)
   @ Core boot.jl:271
  convert(::Type{T}, !Matched::T) where T
   @ Base Base.jl:126

Stacktrace:
 [1] put!(c::Channel{Expr}, v::Float64)
   @ Base ./channels.jl:357
 [2] top-level scope
   @ REPL[7]:4
 [3] eval
   @ ./boot.jl:430 [inlined]
 [4] (::var"#1#2"{Sockets.TCPSocket, UInt64, Bool, @Kwargs{}, Tuple{Module, Expr}, typeof(Core.eval)})()
   @ Main ~/.julia/packages/Malt/Z3YQq/src/worker.jl:120
Stacktrace:
 [1] unwrap_worker_result(worker::Malt.Worker, result::Malt.WorkerResult)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:50
 [2] _wait_for_response(worker::Malt.Worker, msg_id::UInt64)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:325
 [3] _send_receive
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:336 [inlined]
 [4] #remote_call_wait#43
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:422 [inlined]
 [5] remote_call_wait
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:421 [inlined]
 [6] remote_eval_wait
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:491 [inlined]
 [7] put!(rc::Malt.RemoteChannel{Any}, v::Expr)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:532
 [8] top-level scope
   @ REPL[7]:1

OK, the Malt example doesn't error anymore when I remove the type parameter of the Channel (seems like a bug?). But then the random numbers in subsequent evaluations are still not reproducible:

julia> using Malt

julia> w = Malt.Worker();

julia> cin = Malt.worker_channel(w, :(cin = Channel(0)));

julia> cout = Malt.worker_channel(w, :(cout = Channel(0)));

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
0.10690497033454253

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
-0.047221952638015874

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
0.37395225126097864

Should this discrepancy with Distributed be considered a bug? Or do I use Malt incorrectly?

@jkrumbiegel
Copy link
Collaborator

I thought it would look something like this:

using Malt

function initialize_special_channels(worker)
    @assert Malt.remote_eval_fetch(worker, quote
        var"__special_channel_out" = Channel()
        var"__special_channel_in" = Channel{Expr}() do chan
            for expr in chan
                result = Core.eval(Main, expr)
                put!(var"__special_channel_out", result)
            end
        end
        true
    end)
end

function remote_eval_fetch_channeled(worker, expr::Expr)
    code = quote
        Threads.@spawn put!(var"__special_channel_in", $(QuoteNode(expr)))
        take!(var"__special_channel_out")
    end
    return Malt.remote_eval_fetch(worker, code)
end

w = Malt.Worker()
initialize_special_channels(w)

I do get consistent random numbers with this technique:

image

@devmotion
Copy link
Contributor

Should we go with that? The reproducibility issue came up in our triage call today, and we thought we should inform users in the documentation about this problem if it takes more time to fix it. But if we can fix it right away, this might be the better option?

@MichaelHatherly
Copy link
Collaborator

Should we go with that?

Sounds fine to me 👍

@jkrumbiegel
Copy link
Collaborator

I'll attempt a PR

@fonsp
Copy link

fonsp commented Oct 14, 2024

Nice! Could you also make a PR to Malt to add this consistency as a test?

@jkrumbiegel
Copy link
Collaborator

Nice! Could you also make a PR to Malt to add this consistency as a test?

@fonsp Wait I thought I'd implement this logic in QuartoNotebookRunner, did you think it was a good fit for Malt itself?

@fonsp
Copy link

fonsp commented Oct 14, 2024

No my suggestion is to add a test to Malt for this behaviour, to guarantee that this will continue working in future Malt versions, since you depend on it.

@jkrumbiegel
Copy link
Collaborator

To my understanding I'm not depending on anything new in Malt with this workaround, I'm depending on Julia's channel and task behavior. If you change your implementation some day to use a stable task for all evals then the workaround might not be necessary anymore but it shouldn't break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants