Native support for a reproducible parallel RNG streams? #6

pat-s · 2020-01-14T20:33:05Z

Currently, the {doRNG} package fills the gap for reproducible parallel streams in combination with the %dopar% operator.

@HenrikBengtsson and I were wondering if there ever was a discussion about an integrated support for this in the {foreach} package?

Currently, there are multiple ways to achieve this in R but none is really document well here or in {doRNG}. We are a bit worried about possible confusion for the end user and lack of documentation.

Would there be motivation/resources from your side to simplify things here?

cc @renozao

The text was updated successfully, but these errors were encountered:

hongooi73 · 2020-01-15T10:52:55Z

This sounds like a good idea, and much better than having doRNG sitting off to the side. I'm a bit short on spare cycles at the moment though. If anyone wants to contribute a PR, I'm happy to merge it.

hongooi73 · 2020-01-15T11:14:44Z

@richcalaway just wondering, would you have any comments on this? Did the issue of RNG streams ever come up in the history of foreach?

richcalaway · 2020-01-15T19:41:41Z

Indeed, it did. However, I'm afraid I punted on the whole idea once doRNG became available--for all of my examples, "it just worked", so I didn't spend any time trying to integrate it further.

bwlewis · 2020-01-29T07:02:23Z

Just chiming in with my 2c: As foreach is currently structured, support for parallel RNG streams needs to be backend adapter specific. Because, each adapter can have runtime options that affect the operation of the RNG -- see for instance the very general L'Ecuyer support in doRedis here: https://github.com/bwlewis/doRedis. That adapter needs to support run-time chunk size options, affecting the RNG.

It was a similar situation in the original foreach implementation with the old network spaces back end adapter (which supported similar run time options).

The upshot is that each parallel adapter should provide support for this!

I'm working on a guide to writing foreach adapters here: https://github.com/bwlewis/writing_foreach_adapters/ but the first draft does not cover RNGs.

hongooi73 · 2020-01-29T09:16:18Z

Hi @bwlewis, it's great to hear from more Revo alumni!

If I understand correctly, does this mean that RNG support should still be part of the individual adapter packages?

bwlewis · 2020-01-29T19:24:38Z

This thread poses an important question with many possibilities to consider. Clearly, specific adapter packages need to manage details associated with parallel RNGs. For instance, doRedis not only needs to deal with run-time dynamic cluster sizes but also variable "chunk" sizes (multiple loop iterations per worker). Even worse, doRedis is fault tolerant, so worker R processes are free to join or leave the computation while it is running! Thus, foreach needs to make sure that a re-scheduled task has the correct parallel RNG state to guarantee reproducible streams. So yes, there are cases where adapters absolutely need to manage that. Having said that, we might consider a few possibilities to make things more bullet proof generally: Perhaps foreach can implement a default L'Ecuyer RNG scheme, but allow fancy adapters like doRedis to override that if they need to. Alternatively, maybe foreach can simply specify an RNG intention as part of its adapter API and leave it up to the adapter to implement things. And, we should coordinate with Henrik probably to come up with a scheme that preserves interoperability as much as possible between future and foreach. Just some ideas...

…

On 1/29/20, Hong Ooi ***@***.***> wrote: Hi @bwlewis, it's great to hear from more Revo alumni! If I understand correctly, does this mean that RNG support should still be part of the individual adapter packages? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #6 (comment)

renozao · 2020-02-26T15:31:59Z

Hi, jumping in with some comments and a bit of a delay. I think that users and foreach backend developers could benefit a lot from having RNG management handled within foreach. Simpler than having the extra doRNG layer. Usage could be toggled from a runtime and global option to ensure backward compatibility. Backends could also declare at registration whether they want to use the built-in RNG manager or do their own thing. Integrating the RNG stream management with foreach should be relatively straightforward. In the end what I do in doRNG is preparing the sequence of RNG streams and prepending the expression of each task with a single command that sets the RNG into the right RNG stream. This should even work with Redis re-scheduling since Redis uses the modified task and would re-run the RNG seeding at the beginning of the task. Current doRNG implementation is admittedly hacky and fragile in the absence of a pre-run hook that could be configured to be executed before each task. I notably rely on some of foreach internal implementation details that could change anytime. In the end the main `foreach` function has everything needed to do this part. Hope this helps.

stephematician · 2020-08-16T08:19:56Z

I thought I'd add/ask the question: are 'reproducible' results possible at all with foreach/dopar? It looks to me like the way the inputs are partitioned is non-deterministic. For example, most of the time the call to identical here will return FALSE

require(doParallel)

cl <- makeCluster(5)
registerDoParallel(cl)

# mark each core
clusterEvalQ(cl, a <- 0)
clusterApply(cl, 1:5, function(x) a <<- x)

A <- foreach(i=1:20) %dopar% {
    a
}
B <- foreach(i=1:20) %dopar% {
    a
}

identical(A, B)
# [1] FALSE
# :'(

stopCluster(cl)

As I have just come to understand, the way doRNG circumvents that is by generating a stream for every iteration, rather than, say, one RNG stream per core.

HenrikBengtsson · 2020-08-17T19:42:00Z

Hi @stephematician, I think your use-case/example is a bit unusual because there each worker is unique and it requires that one worker must not be substituted by another worker. Unless there is a really good reason for having this setup, I suggest rethinking the design. Parallel processing is much easier if you can treat all workers identically and if one worker goes down you can bring in another worker to perform the same task. In your example, I would make sure to pass a as part of the map-reduce call, i.e. here the foreach() call, and not as part of the worker setup/initialization. Is there a reason why you wouldn't do the latter?

stephematician · 2020-08-18T00:22:10Z

Hi @stephematician, I think your use-case/example is a bit unusual because there each worker is unique and it requires that one worker must not be substituted by another worker. Unless there is a really good reason for having this setup, I suggest rethinking the design.

Thanks for the reply @HenrikBengtsson I fully agree; I was 'thinking out loud' about how reproducibility can be thwarted if there is some state that is updated within each process.

In your example, I would make sure to pass a as part of the map-reduce call, i.e. here the foreach() call, and not as part of the worker setup/initialization. Is there a reason why you wouldn't do the latter?

Ideally, hopefully, these are always passed via the iterator, yeah. It came up a while ago when I was trying to decide how to have multiple rng streams within each process. It was easier than I first thought. I think one (correct?) way to handle that is something like:

foreach(i=1:n, .options.RNG=reproducible_seed) %dorng% {

    # get `n_stream` independent streams
    process_RNG_seq <- RNGseq(n_streams)

    # helper to use a specific RNG stream
    eval_using_process_RNG <- function(j, ex, envir=parent.frame()) {
        ws_RNG <- setRNG(process_RNG_seq[j])
        tryCatch({
            eval(ex, envir=envir)
        }, finally={
            # careful to use <<- here
            process_RNG_seq[j] <<- setRNG(ws_RNG)
        })
    }

    # ... now do the usual stuff in the loop, any time I need to use a
    # specific RNG stream, I call `eval_using_process_RNG()`

}

I am fairly confident (although I haven't checked carefully) that the numbers will be random enough for most problems. In the above loop; process_RNG_seq is visible everywhere - which is a little 'sloppy', but that can be fixed with little effort.

(edit) perhaps more usefully, if the streams need to be repeated for each iteration, then the value of process_RNG_seq is determined before the loop, exported to each process, and copied at the start of each iteration (/edit)

GitHunter0 · 2021-03-29T01:53:13Z

It would indeed be really useful to have foreach handling random numbers with %dopar% without the need to change to %dorng%. The lack of direct support makes foreach / %dopar% way less universal since random numbers are everywhere. I really hope that integration happens.
Despite that, foreach is a brilliant package. Thank you

hongooi73 added the help wanted Extra attention is needed label Jan 15, 2020

HenrikBengtsson mentioned this issue Feb 26, 2020

About a best practice vignette renozao/doRNG#15

Open

HenrikBengtsson mentioned this issue Dec 12, 2020

Discussion: About reproducible parallel RNG streams HenrikBengtsson/doFuture#41

Closed

HenrikBengtsson mentioned this issue May 4, 2021

Unreliable random numbers produced when using doFuture backend tidymodels/tune#377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for a reproducible parallel RNG streams? #6

Native support for a reproducible parallel RNG streams? #6

pat-s commented Jan 14, 2020 •

edited

Loading

hongooi73 commented Jan 15, 2020

hongooi73 commented Jan 15, 2020

richcalaway commented Jan 15, 2020

bwlewis commented Jan 29, 2020

hongooi73 commented Jan 29, 2020

bwlewis commented Jan 29, 2020 via email

renozao commented Feb 26, 2020 via email

stephematician commented Aug 16, 2020 •

edited

Loading

HenrikBengtsson commented Aug 17, 2020 •

edited

Loading

stephematician commented Aug 18, 2020 •

edited

Loading

GitHunter0 commented Mar 29, 2021

Native support for a reproducible parallel RNG streams? #6

Native support for a reproducible parallel RNG streams? #6

Comments

pat-s commented Jan 14, 2020 • edited Loading

hongooi73 commented Jan 15, 2020

hongooi73 commented Jan 15, 2020

richcalaway commented Jan 15, 2020

bwlewis commented Jan 29, 2020

hongooi73 commented Jan 29, 2020

bwlewis commented Jan 29, 2020 via email

renozao commented Feb 26, 2020 via email

stephematician commented Aug 16, 2020 • edited Loading

HenrikBengtsson commented Aug 17, 2020 • edited Loading

stephematician commented Aug 18, 2020 • edited Loading

GitHunter0 commented Mar 29, 2021

pat-s commented Jan 14, 2020 •

edited

Loading

stephematician commented Aug 16, 2020 •

edited

Loading

HenrikBengtsson commented Aug 17, 2020 •

edited

Loading

stephematician commented Aug 18, 2020 •

edited

Loading