Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for a reproducible parallel RNG streams? #6

Open
pat-s opened this issue Jan 14, 2020 · 11 comments
Open

Native support for a reproducible parallel RNG streams? #6

pat-s opened this issue Jan 14, 2020 · 11 comments
Labels
help wanted Extra attention is needed

Comments

@pat-s
Copy link

pat-s commented Jan 14, 2020

Currently, the {doRNG} package fills the gap for reproducible parallel streams in combination with the %dopar% operator.

@HenrikBengtsson and I were wondering if there ever was a discussion about an integrated support for this in the {foreach} package?

Currently, there are multiple ways to achieve this in R but none is really document well here or in {doRNG}. We are a bit worried about possible confusion for the end user and lack of documentation.

Would there be motivation/resources from your side to simplify things here?

cc @renozao

@hongooi73
Copy link
Contributor

This sounds like a good idea, and much better than having doRNG sitting off to the side. I'm a bit short on spare cycles at the moment though. If anyone wants to contribute a PR, I'm happy to merge it.

@hongooi73 hongooi73 added the help wanted Extra attention is needed label Jan 15, 2020
@hongooi73
Copy link
Contributor

@richcalaway just wondering, would you have any comments on this? Did the issue of RNG streams ever come up in the history of foreach?

@richcalaway
Copy link
Contributor

Indeed, it did. However, I'm afraid I punted on the whole idea once doRNG became available--for all of my examples, "it just worked", so I didn't spend any time trying to integrate it further.

@bwlewis
Copy link

bwlewis commented Jan 29, 2020

Just chiming in with my 2c: As foreach is currently structured, support for parallel RNG streams needs to be backend adapter specific. Because, each adapter can have runtime options that affect the operation of the RNG -- see for instance the very general L'Ecuyer support in doRedis here: https://github.com/bwlewis/doRedis. That adapter needs to support run-time chunk size options, affecting the RNG.

It was a similar situation in the original foreach implementation with the old network spaces back end adapter (which supported similar run time options).

The upshot is that each parallel adapter should provide support for this!

I'm working on a guide to writing foreach adapters here: https://github.com/bwlewis/writing_foreach_adapters/ but the first draft does not cover RNGs.

@hongooi73
Copy link
Contributor

Hi @bwlewis, it's great to hear from more Revo alumni!

If I understand correctly, does this mean that RNG support should still be part of the individual adapter packages?

@bwlewis
Copy link

bwlewis commented Jan 29, 2020 via email

@renozao
Copy link

renozao commented Feb 26, 2020 via email

@stephematician
Copy link

stephematician commented Aug 16, 2020

I thought I'd add/ask the question: are 'reproducible' results possible at all with foreach/dopar? It looks to me like the way the inputs are partitioned is non-deterministic. For example, most of the time the call to identical here will return FALSE

require(doParallel)

cl <- makeCluster(5)
registerDoParallel(cl)

# mark each core
clusterEvalQ(cl, a <- 0)
clusterApply(cl, 1:5, function(x) a <<- x)

A <- foreach(i=1:20) %dopar% {
    a
}
B <- foreach(i=1:20) %dopar% {
    a
}

identical(A, B)
# [1] FALSE
# :'(

stopCluster(cl)

As I have just come to understand, the way doRNG circumvents that is by generating a stream for every iteration, rather than, say, one RNG stream per core.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Aug 17, 2020

Hi @stephematician, I think your use-case/example is a bit unusual because there each worker is unique and it requires that one worker must not be substituted by another worker. Unless there is a really good reason for having this setup, I suggest rethinking the design. Parallel processing is much easier if you can treat all workers identically and if one worker goes down you can bring in another worker to perform the same task. In your example, I would make sure to pass a as part of the map-reduce call, i.e. here the foreach() call, and not as part of the worker setup/initialization. Is there a reason why you wouldn't do the latter?

@stephematician
Copy link

stephematician commented Aug 18, 2020

Hi @stephematician, I think your use-case/example is a bit unusual because there each worker is unique and it requires that one worker must not be substituted by another worker. Unless there is a really good reason for having this setup, I suggest rethinking the design.

Thanks for the reply @HenrikBengtsson I fully agree; I was 'thinking out loud' about how reproducibility can be thwarted if there is some state that is updated within each process.

In your example, I would make sure to pass a as part of the map-reduce call, i.e. here the foreach() call, and not as part of the worker setup/initialization. Is there a reason why you wouldn't do the latter?

Ideally, hopefully, these are always passed via the iterator, yeah. It came up a while ago when I was trying to decide how to have multiple rng streams within each process. It was easier than I first thought. I think one (correct?) way to handle that is something like:

foreach(i=1:n, .options.RNG=reproducible_seed) %dorng% {

    # get `n_stream` independent streams
    process_RNG_seq <- RNGseq(n_streams)

    # helper to use a specific RNG stream
    eval_using_process_RNG <- function(j, ex, envir=parent.frame()) {
        ws_RNG <- setRNG(process_RNG_seq[j])
        tryCatch({
            eval(ex, envir=envir)
        }, finally={
            # careful to use <<- here
            process_RNG_seq[j] <<- setRNG(ws_RNG)
        })
    }

    # ... now do the usual stuff in the loop, any time I need to use a
    # specific RNG stream, I call `eval_using_process_RNG()`

}

I am fairly confident (although I haven't checked carefully) that the numbers will be random enough for most problems. In the above loop; process_RNG_seq is visible everywhere - which is a little 'sloppy', but that can be fixed with little effort.

(edit) perhaps more usefully, if the streams need to be repeated for each iteration, then the value of process_RNG_seq is determined before the loop, exported to each process, and copied at the start of each iteration (/edit)

@GitHunter0
Copy link

It would indeed be really useful to have foreach handling random numbers with %dopar% without the need to change to %dorng%. The lack of direct support makes foreach / %dopar% way less universal since random numbers are everywhere. I really hope that integration happens.
Despite that, foreach is a brilliant package. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants