Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REQ - Guidance on writing R code that will run in parallel #10

Open
terrymclaughlin opened this issue Mar 9, 2023 · 6 comments · May be fixed by #110
Open

REQ - Guidance on writing R code that will run in parallel #10

terrymclaughlin opened this issue Mar 9, 2023 · 6 comments · May be fixed by #110
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@terrymclaughlin
Copy link
Contributor

For example:

  • Explain single-threaded vs multi-threaded and what parallel processing is
  • Explore the futureverse! 🚀
    • Correctly identifying the number of CPUs available to use the session with {parallelly}
    • Write functions and interate using the {furrr} package, rather than {purrr}
  • Use the {multidplyr} backend with {dplyr}

Other useful links:

@terrymclaughlin terrymclaughlin self-assigned this Mar 9, 2023
@terrymclaughlin
Copy link
Contributor Author

@ciarag01 @jakeybob @Moohan @rmccreath

Just alerting you to this issue. My plan is to draft some guidance, as it's clear that most R code in PHS is being written as single-threaded and not taking advantage of the multiple CPUs available in a Posit Workbench session. This could result in significant performance improvements when processing large datasets.

@terrymclaughlin terrymclaughlin added the enhancement New feature or request label Mar 9, 2023
@terrymclaughlin
Copy link
Contributor Author

@CliveWG @fraserstirrat

In case you see any queries coming in requesting guidance on parallel processing, you can tell people that this is on our radar and we're developing guidance for this.

@Moohan
Copy link
Member

Moohan commented Mar 9, 2023

furrr is low-hanging fruit - It requires code to already be written to use purrr but if it has it's a super simple switch. There is a bit of overhead on 'setting up the workers' so it's a subjective call on when it's worth it though. I guess that applies to all of the parallelisation methods though!

@jakeybob
Copy link

jakeybob commented Mar 9, 2023

I'm not sure what the best route is here. All the different available methods make it quite thorny.

I don't think purrr is used that widely internally at the moment. Or at least, I suspect any code that uses purrr heavily was probably written by a techy person who would be able to convert to furrr easily on their own.

And, I feel like any guidance along the lines of "here are several different ways you can do this" won't be well received.

So do we choose one way to recommend...? This would be better for consistency and support/training but a) I'm not convinced this is the best idea and b) even if it is, I don't know which method would be the best to pick...

Should probably sidestep the foreach and doParallel side of things and go with furrr or multidplyr though I guess? They're both tidyverse friendly. multidplyr probably slots into existing dplyr code blocks the easiest and has the smaller mental overhead, but 🤷🏻

@rmccreath rmccreath transferred this issue from Public-Health-Scotland/R-Resources Mar 22, 2023
@rmccreath rmccreath changed the title Write guidance on writing R code that will run in parallel i.e. on multiple CPUs REQ - Guidance on writing R code that will run in parallel Mar 22, 2023
@rmccreath rmccreath added documentation Improvements or additions to documentation and removed enhancement New feature or request labels Mar 22, 2023
@Moohan
Copy link
Member

Moohan commented Mar 29, 2023

Thought this might be the best place to ask this question, and if no one knows it's just another thing to add to future guidance!

If I use plan(multisession) which is the one you're led to when using RStudio, on PWB will this create new nodes?

For example, if I have a session with 8 CPUs and 4GB of RAM, will this be shared among the 'sessions' or will it spawn new nodes for the new sessions, in which case what limits/specs do they have?

@jakeybob
Copy link

I suspect this will run in the current session only and the workers spawned will be more equivalent to "background jobs" (running as independent R processes but sharing the parent session total resources) than "workbench jobs" (starting new sessions with their own resources).

Only one way to find out for sure though – give it a punt and see what happens? 😀

@terrymclaughlin terrymclaughlin linked a pull request Jun 18, 2024 that will close this issue
@terrymclaughlin terrymclaughlin linked a pull request Jun 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants