Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Parallel Runner #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DE0CH
Copy link
Collaborator

@DE0CH DE0CH commented Dec 13, 2022

Instead of letting irace launch new target runners, we let python control the workers of target runner. This improves performance and also fixes parllel runner crashing on windows (fixes #16). I ran the tests/test_dual_annealing. Excluding the import and setup, the run time of the tuning was reduced from 1.46s to 0.87s.

A few implementation details to explain:

  • This does not make use of exec.target.runner, but instead uses check.output.target.runner (PR: Export check output target runner MLopez-Ibanez/irace#49) to check the output of the target runner. I did this because I noticed that passing an r function crashes the queue so hard that it just becomes empty forever without raising any exception. Not sure why. Perhaps one day when I have more time I'll investigate it more and report the bug / submit PR to rpy2 or python. As a result, this does not benefit from the retrying function in exec.target.runner but it doesn't seem to be missing anything else.
  • Instead of passing the scenario object given by irace to the target runner, it keeps track of the scenario object (before it's been processed by irace). I think this is faster because there wouldn't be r to python conversion and also sometimes r objects crashes a queue as well.
  • It removes the parllel field from scenario before passing it to irace because otherwise irace seems to still read the parallel field and tries to do something that makes it crash on windows.

Depends on MLopez-Ibanez/irace#49.

@DE0CH DE0CH force-pushed the restructure-parallel branch 2 times, most recently from bed7752 to cefa1c1 Compare December 13, 2022 16:52
@MLopez-Ibanez
Copy link
Contributor

* It removes the `parllel` field from scenario before passing it to irace because otherwise irace seems to still read the parallel field and tries to do something that makes it crash on windows.

I think I know why this is happening. Could you try again with the latest version from github? I don't want to work-around bugs in irace. I'd rather fix irace and avoid having to do weird things in iracepy.

@MLopez-Ibanez
Copy link
Contributor

* Instead of passing the scenario object given by irace to the target runner, it keeps track of the scenario object (before it's been processed by irace). I think this is faster because there wouldn't be r to python conversion and also sometimes r objects crashes a queue as well.

The downside is that irace will set defaults and do other things that change the scenario object but are crucial for a correct behavior. Perhaps an intermediate solution is to convert the scenario returned by irace back to python once and save it.

@DE0CH
Copy link
Collaborator Author

DE0CH commented Dec 15, 2022

The downside is that irace will set defaults and do other things that change the scenario object but are crucial for a correct behavior. Perhaps an intermediate solution is to convert the scenario returned by irace back to python once and save it.

I see. Perhaps I could use checkScenario in irace?

@MLopez-Ibanez
Copy link
Contributor

I see. Perhaps I could use checkScenario in irace?

Sure, that is its purpose.

@DE0CH
Copy link
Collaborator Author

DE0CH commented Dec 17, 2022

Oh but I noticed there's a fixme: "FIXME: This function should only do checks and return TRUE/FALSE." Does this mean the function might return TRUE/FALSE in the future and whatever relying on it will break?

@MLopez-Ibanez
Copy link
Contributor

Oh but I noticed there's a fixme: "FIXME: This function should only do checks and return TRUE/FALSE." Does this mean the function might return TRUE/FALSE in the future and whatever relying on it will break?

That is the long-term plan but the chances of this happening soon are very small. It will need to be split into two functions anyway and one of them will set the defaults.

@DE0CH DE0CH force-pushed the restructure-parallel branch from a7be18b to 4c5b379 Compare December 17, 2022 19:33
@codecov-commenter
Copy link

codecov-commenter commented Dec 17, 2022

Codecov Report

Merging #22 (68ff4ae) into main (b2177a6) will decrease coverage by 0.70%.
The diff coverage is 87.03%.

❗ Current head 68ff4ae differs from pull request most recent head 8b94a72. Consider uploading reports for the commit 8b94a72 to get more accurate results

@@            Coverage Diff             @@
##             main      #22      +/-   ##
==========================================
- Coverage   92.37%   91.66%   -0.71%     
==========================================
  Files          10       11       +1     
  Lines         354      384      +30     
==========================================
+ Hits          327      352      +25     
- Misses         27       32       +5     
Impacted Files Coverage Δ
tests/utils.py 73.33% <73.33%> (ø)
src/irace/__init__.py 88.98% <86.95%> (-0.85%) ⬇️
tests/test_data_conversion.py 86.20% <100.00%> (+5.25%) ⬆️
tests/test_dual_annealing.py 94.87% <100.00%> (+1.12%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@DE0CH DE0CH changed the base branch from dev to main December 17, 2022 19:59
@DE0CH DE0CH force-pushed the restructure-parallel branch from 4c5b379 to c9bcba7 Compare December 19, 2022 01:44
@DE0CH DE0CH marked this pull request as draft December 19, 2022 15:11
@DE0CH
Copy link
Collaborator Author

DE0CH commented Dec 19, 2022

There's a bug with this. Let me fix it first it.

@DE0CH DE0CH force-pushed the restructure-parallel branch 3 times, most recently from 15f8cb6 to 0eb8401 Compare December 19, 2022 17:59
@DE0CH DE0CH force-pushed the restructure-parallel branch 2 times, most recently from 1a39749 to b3604a2 Compare December 21, 2022 03:22
@DE0CH DE0CH marked this pull request as ready for review December 21, 2022 03:23
@DE0CH DE0CH requested a review from MLopez-Ibanez December 21, 2022 03:23
@DE0CH DE0CH force-pushed the restructure-parallel branch from b3604a2 to 3c88734 Compare December 21, 2022 03:29
@@ -149,10 +187,15 @@ def set_initial(self, x):

def run(self):
"""Returns a Pandas DataFrame, one column per parameter and the row index are the configuration ID."""
self.scenario['targetRunner'] = self.r_target_runner
self.r_target_runner_parallel = make_target_runner_parallel(self.target_aq, self.target_rq, self._pkg.check_output_target_runner, self.scenario.copy())
self.scenario['targetRunnerParallel'] = self.r_target_runner_parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to always use self.scenario['targetRunnerParallel'] but it should only be used if the user actually wants to run in parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I thought to just spawn a worker regardless but that might create a problem on platforms where a user's process is limited. I'll change it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the change, I just made the target runner parallel to call the function directly if parallel is 0 or 1. Hope this is ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everything is otherwise the same, I would prefer to use targetRunnerParallel instead of using targetRunner even when user does not use parallel because then I could use pure python function and pure python objects through the queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that your targetRunnerParallel is much more complex than the code it replaces. Is the main benefit to avoid conversions between R/python and increase speed? Also forcing users to use targetRunnerParallel means that they cannot use any of the other parallelization mechanisms in irace, such as MPI/qsub and their own targetRunnerParallel, they cannot use the automatic retries and maybe other things that I have not even thought about.

Thus, I would rather prefer to add an option to irace.init(), for example, "python_parallel=True" that selects to either use your python-based targetRunnerParallel or simply uses the R implementation without changing targetRunnerParallel as it is now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I think it's in part due to me not understanding how to use MPI and not seeing the benefit of it. I started it because I thought it would make it faster, and there was also a problem with garbage collection with the old code which I can't reproduce nor know what's causing it, but I thought spawning new python threads from an embedded R might trigger some edge case in the gc.

It is possible to avoid conversion between R/python without this targetRunnerParallel. But does MPI create new threads for each target runner run, or does it reuse the same thread like a worker pool? If it creates a new thread every time, it might have more overhead than the worker queue that I wrote here.

The only benefit, it remains, is that it fixes the crash on Windows (#16). I think the proper way to fix it is to fix it in the irace R code / MPI / rpy2. But it seems that the error is due to some low level dynamically linked library / shared symbol. I don't think I have the expertise to understand and fix it.

Thus, I would rather prefer to add an option to irace.init(), for example, "python_parallel=True" that selects to either use your python-based targetRunnerParallel or simply uses the R implementation without changing targetRunnerParallel as it is now.

I would actually not prefer this. This creates some confusion for the user. It also create more work for us to maintain two separate logic for stuff like #30. So I would propose that we either close this PR and declare it a failed experiment, or we try to replicate the irace functions like using MPI/qsub, allowing users to set targetRunnerParallel and implement retires.

@DE0CH DE0CH force-pushed the restructure-parallel branch 5 times, most recently from 9b5c721 to 5bf567a Compare December 23, 2022 08:51
@DE0CH DE0CH mentioned this pull request Dec 23, 2022
@DE0CH DE0CH requested a review from MLopez-Ibanez December 23, 2022 09:01
@DE0CH DE0CH force-pushed the restructure-parallel branch from 5bf567a to 8b94a72 Compare December 28, 2022 16:53
if 'targetRunnerParallel' in scenario:
raise NotImplementedError("targetRunnerParallel is not yet supported. If you need this feature, consider opening an issue to show us some people actually want to use this.")

def run_irace(irace, args, q: Queue):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function doing? It does not seem to be used anywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry forgot to delete this.

self.target_aq = None
self.target_rq = None
self.workers: list[Process] = []
if self.worker_count != 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this code within the above if self.worker_count != 1?

@@ -4,16 +4,12 @@
from irace import irace
import pandas as pd
from multiprocessing import Process
import os

import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered that for daemonic processes, no child process can be started, meaning that the target runner cannot create new threads. The python multiprocessing.Pool uses daemonic processes, so this is here to prevent regression.

from rpy2.rinterface_lib.sexp import NACharacterType
from multiprocessing import Queue, Process
import json
from rpy2.rinterface_lib.sexp import NACharacterType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

from rpy2.robjects.conversion import localconverter
from rpy2 import rinterface as ri
from rpy2.rinterface_lib import na_values
from rpy2.rinterface_lib.sexp import NACharacterType
from multiprocessing import Queue, Process
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should belong to #30

@@ -149,10 +187,15 @@ def set_initial(self, x):

def run(self):
"""Returns a Pandas DataFrame, one column per parameter and the row index are the configuration ID."""
self.scenario['targetRunner'] = self.r_target_runner
self.r_target_runner_parallel = make_target_runner_parallel(self.target_aq, self.target_rq, self._pkg.check_output_target_runner, self.scenario.copy())
self.scenario['targetRunnerParallel'] = self.r_target_runner_parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that your targetRunnerParallel is much more complex than the code it replaces. Is the main benefit to avoid conversions between R/python and increase speed? Also forcing users to use targetRunnerParallel means that they cannot use any of the other parallelization mechanisms in irace, such as MPI/qsub and their own targetRunnerParallel, they cannot use the automatic retries and maybe other things that I have not even thought about.

Thus, I would rather prefer to add an option to irace.init(), for example, "python_parallel=True" that selects to either use your python-based targetRunnerParallel or simply uses the R implementation without changing targetRunnerParallel as it is now.

self.scenario = r_to_python(self.r_scenario)
self.scenario.pop('targetRunnerParallel', None)
scenario_a[0] = self.scenario
scenario_a[1] = self.r_scenario
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need a comment to explain what is going on: You pass scenario_a to make_target_runner_parallel using None values, but then assign values here? How does that work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_target_runner_parallel needs to know the scenario but to make the scenario (pass the checkScenario), the targetRunnerParallel must be set, creating a chicken and egg problem. scenario_a kind of acts like a pointer. I can give it a null pointer first and then set the value later.

@DE0CH DE0CH mentioned this pull request Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel on Windows not Working
3 participants