-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native support for a reproducible parallel RNG streams? #6
Comments
This sounds like a good idea, and much better than having doRNG sitting off to the side. I'm a bit short on spare cycles at the moment though. If anyone wants to contribute a PR, I'm happy to merge it. |
@richcalaway just wondering, would you have any comments on this? Did the issue of RNG streams ever come up in the history of foreach? |
Indeed, it did. However, I'm afraid I punted on the whole idea once doRNG became available--for all of my examples, "it just worked", so I didn't spend any time trying to integrate it further. |
Just chiming in with my 2c: As foreach is currently structured, support for parallel RNG streams needs to be backend adapter specific. Because, each adapter can have runtime options that affect the operation of the RNG -- see for instance the very general L'Ecuyer support in doRedis here: https://github.com/bwlewis/doRedis. That adapter needs to support run-time chunk size options, affecting the RNG. It was a similar situation in the original foreach implementation with the old network spaces back end adapter (which supported similar run time options). The upshot is that each parallel adapter should provide support for this! I'm working on a guide to writing foreach adapters here: https://github.com/bwlewis/writing_foreach_adapters/ but the first draft does not cover RNGs. |
Hi @bwlewis, it's great to hear from more Revo alumni! If I understand correctly, does this mean that RNG support should still be part of the individual adapter packages? |
This thread poses an important question with many possibilities to consider.
Clearly, specific adapter packages need to manage details associated
with parallel RNGs. For instance, doRedis not only needs to deal with
run-time dynamic cluster sizes but also variable "chunk" sizes
(multiple loop iterations per worker). Even worse, doRedis is fault
tolerant, so worker R processes are free to join or leave the
computation while it is running! Thus, foreach needs to make sure that
a re-scheduled task has the correct parallel RNG state to guarantee
reproducible streams.
So yes, there are cases where adapters absolutely need to manage that.
Having said that, we might consider a few possibilities to make things
more bullet proof generally:
Perhaps foreach can implement a default L'Ecuyer RNG scheme, but allow
fancy adapters like doRedis to override that if they need to.
Alternatively, maybe foreach can simply specify an RNG intention as
part of its adapter API and leave it up to the adapter to implement
things.
And, we should coordinate with Henrik probably to come up with a
scheme that preserves interoperability as much as possible between
future and foreach.
Just some ideas...
…On 1/29/20, Hong Ooi ***@***.***> wrote:
Hi @bwlewis, it's great to hear from more Revo alumni!
If I understand correctly, does this mean that RNG support should still be
part of the individual adapter packages?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#6 (comment)
|
Hi,
jumping in with some comments and a bit of a delay.
I think that users and foreach backend developers could benefit a lot from
having RNG management handled within foreach. Simpler than having the extra
doRNG layer.
Usage could be toggled from a runtime and global option to ensure backward
compatibility.
Backends could also declare at registration whether they want to use the
built-in RNG manager or do their own thing.
Integrating the RNG stream management with foreach should be relatively
straightforward. In the end what I do in doRNG is preparing the sequence of
RNG streams and prepending the expression of each task with a single
command that sets the RNG into the right RNG stream. This should even work
with Redis re-scheduling since Redis uses the modified task and would
re-run the RNG seeding at the beginning of the task.
Current doRNG implementation is admittedly hacky and fragile in the absence
of a pre-run hook that could be configured to be executed before each task.
I notably rely on some of foreach internal implementation details that
could change anytime.
In the end the main `foreach` function has everything needed to do this
part.
Hope this helps.
|
I thought I'd add/ask the question: are 'reproducible' results possible at all with foreach/dopar? It looks to me like the way the inputs are partitioned is non-deterministic. For example, most of the time the call to require(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
# mark each core
clusterEvalQ(cl, a <- 0)
clusterApply(cl, 1:5, function(x) a <<- x)
A <- foreach(i=1:20) %dopar% {
a
}
B <- foreach(i=1:20) %dopar% {
a
}
identical(A, B)
# [1] FALSE
# :'(
stopCluster(cl) As I have just come to understand, the way doRNG circumvents that is by generating a stream for every iteration, rather than, say, one RNG stream per core. |
Hi @stephematician, I think your use-case/example is a bit unusual because there each worker is unique and it requires that one worker must not be substituted by another worker. Unless there is a really good reason for having this setup, I suggest rethinking the design. Parallel processing is much easier if you can treat all workers identically and if one worker goes down you can bring in another worker to perform the same task. In your example, I would make sure to pass |
Thanks for the reply @HenrikBengtsson I fully agree; I was 'thinking out loud' about how reproducibility can be thwarted if there is some state that is updated within each process.
Ideally, hopefully, these are always passed via the iterator, yeah. It came up a while ago when I was trying to decide how to have multiple rng streams within each process. It was easier than I first thought. I think one (correct?) way to handle that is something like: foreach(i=1:n, .options.RNG=reproducible_seed) %dorng% {
# get `n_stream` independent streams
process_RNG_seq <- RNGseq(n_streams)
# helper to use a specific RNG stream
eval_using_process_RNG <- function(j, ex, envir=parent.frame()) {
ws_RNG <- setRNG(process_RNG_seq[j])
tryCatch({
eval(ex, envir=envir)
}, finally={
# careful to use <<- here
process_RNG_seq[j] <<- setRNG(ws_RNG)
})
}
# ... now do the usual stuff in the loop, any time I need to use a
# specific RNG stream, I call `eval_using_process_RNG()`
} I am fairly confident (although I haven't checked carefully) that the numbers will be random enough for most problems. In the above loop; (edit) perhaps more usefully, if the streams need to be repeated for each iteration, then the value of |
It would indeed be really useful to have |
Currently, the {doRNG} package fills the gap for reproducible parallel streams in combination with the
%dopar%
operator.@HenrikBengtsson and I were wondering if there ever was a discussion about an integrated support for this in the {foreach} package?
Currently, there are multiple ways to achieve this in R but none is really document well here or in {doRNG}. We are a bit worried about possible confusion for the end user and lack of documentation.
Would there be motivation/resources from your side to simplify things here?
cc @renozao
The text was updated successfully, but these errors were encountered: