-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid workers waiting for training of surrogate model to finish #1170
Comments
Hi @bbudescu, However, the new RF model might still rely on the existing packages (e.g., sklearn). Since we are only a small team maintaining SMAC, we might not have enough manpower to customize a new random forest package. However, if you have any good ideas on how to implement this efficiently (in Python) or you would like to create a new PR regarding the RF replacement stuff. We are happy to help you (on how to integrate this into SMAC). |
Hi @dengdifan, Thanks for your reply above. Yes, I am aware of the #1116, however, the main point I was trying to get across in this issue was making the Now, I haven't looked into the code, but I assume this doesn't have anything to do with the choice of random forest package, but just with running RF training in a separate thread. Perhaps one of the optional suggestions I made, namely, the one where I suggested adding GPU support, might be relevant to the choice of RF implementation. |
Hi @bbudescu Seond, SMAC is implemented based on BO, an algorithm designed for optimizing expensive black-box functions, such as training a deep neural network. Usually, the expense of evaluating a function is much more significant than training the surrogate model. It is still unclear if directly starting a new configuration is better than waiting until the current surrogate model provides a new candidate. Additionally, making |
Hi @dengdifan
I've been working on something, and I just added #1178 to share it here. It's not yet tested, but it's a start, I guess.
That's quite right :). I had to do a bit of reading before finally deciding at least on what to try
Now, while looking into this I've taken a few ideas into consideration, namely, Other than the fact that I used
Of course, because of the GIL, the main optimization loop gets less cpu time, but although I haven't tested, I expect that the main thread would spend most time in the training section anyway. And I don't think the other parts in the loop are very CPU intensive anyway. Or are they? Do you know if, perhaps the inference or something takes really long?
Well, other than expanding SMAC's use cases to things like database knob tuning and other tasks that also have a low evaluation time for the target cost function, it's also an impeding the scalability of SMAC. E.g., let's say you want to train a neural net that takes, on average, 2 hours on a gpu to fit to a dataset and you need to find the optimal parameters fast. In this case you might opt to do something like renting 256 machines on aws so that you increase the chance you get a decent neural net within a tight budget like 24 hours. In this case you will face the same problem, because you'll get new results every 30 seconds anyway, not because it takes little to train neural nets, but because of the number of training processes.
If you're paying for an aws instance with a gpu you'd hate it to see it stay idle. You'd even prefer it even running a random config rather than just doing nothing.
Well, I'm not sure I understand this correctly, but if I do, then I don't think that is the case other than in the beginning. So, here's an example at the opposite end: let's say that you already finished 1,000,000 trials, and trained the RF on that data. Let's now say that a worker process finishes running the 1,000,001st trial, reports results and asks for a new configuration to try. What happens in the current setting is that the worker needs to wait for the model to retrain taking into consideration that last result, so it can benefit from the extra information added by that extra sample in the RF training set. Now, my hypothesis is that there is only a minor benefit of doing this, as I don't think the expected improvement in the cost function caused by making the suggestion based on 1,000,001 vs 1,000,000 is going to be that much different anyway. And also, when setting So, to be clear, I'm not suggesting using random configurations, but, rather, to adapt the LE: Here is a comparison between the performance of some IPC methods when sharing numpy data. |
Hi @bbudescu Now I am confusing by your two examples.
In your first example (the neural network training one), you suggest sampling a new configuration randomly:
This is exactly the case for training a RF in a separated thread. Assuming that you have two cores, one core already finishes its configuration evaluation. Since your RF is still under training (you can not expect that both RF training and configuration evaluation ends at exact the same time clock for both CPUs), you cannot expect how long it would take to finish the RF training in another core(thread). Then in that case you simply start a new configuration randomly (as suggested by your AWS example) even the RF training will end just 1 CPU clock after. Therefore, in this case, you always have the core 1 running BO+evaluation and core 2 running random evaluation.
Here you suggest to adapt the retrain_after parameters dynamically instead of sampling new configurations randomly. However, how should this value be scheduled? To which degree should we increase this value? Should we always apply this strategy or we only apply this if the number of budgets is higher than a threshold? Since we are aiming at arbitrary budgets and number of evaluations, all these need to be carefully designed to avoid performance degeneration |
Sorry, I tried to make a comparison. Conceptually, training in a different thread would be somewhat similar to when
Yes, the key word here is "even". I think even a random config is better than nothing, but what I propose is to use a trained random forest to suggest new configs, just we shouldn't care so much that it was trained on the absolute latest results, just as in the case when
Ok, so look, let's say we have 2 cores. One core always trains (because training takes longer than evaluation), and one core always evaluates the cost function (i.e., it's a worker that runs trials):
NOTE: Indeed, at step 5 above, it's also possible for thread A to start running another random trial, but this is NOT the subject of this feature request, but rather a potential further optimization similar to what you said above. NOTE 2: actually, what I think would be more relevant and more in line with what you asked (and I replied to), is that at step 4 above thread B can just start a random config without waiting for the first model to finish training. However, again, this would be just an optimization, and NOT the point of this issue and associated PR |
Hi @bbudescu, As you said, we could simply set |
Well, the problem with setting
A tradeoff is hard to find, and even the best tradeoff will be suboptimal. And, after all, I've just pushed some code to implement this feature. I know it's a potential risk, but, as far as I have come to understand, you guys are ok with at least offering some support for integration (I'm not talking about you implementing it yourself). LE: It's not done yet, i.e., there are no tests, I haven't added in a way to switch to the old behavior, and I still have a few questions about, e.g., what to do with the |
This should also be done dynamically, because, as a user, it's kinda hard to anticipate how long training will take on the current number of finished trials, how long running a trial will take, as it also depends on the hyperparameters in the suggested config, which evolve over time etc. I'm thinking there would be value in adding estimators for all of these, but I feel that just running a second thread on the main cpu when there's some new data available is easier. LE: The alternative for the user would be to do trial and error until you get the schedule right, but that beats the whole purpose of making the session faster / more efficient |
Hi @dengdifan, I think I was able to bring the code to a state in which it can be tested. As per my #1178 (comment), could this, perhaps, be solved by a compromise like merging the pull request and exposing the option of using concurrent background training as an experimental feature, at the user's (documented) risk? |
Hi @bbudescu |
Hi @dengdifan, Thanks for your response. I'm really sorry, but I still don't understand what you mean, unfortunately. Would you care to, maybe elaborate a bit further, or perhaps try to explain in a different fashion than before? Perhaps, take a look at the code I propose in PR #1178, and show me where this might happen. I mean, the behavior using my code is definitely not what I was after, so by all means, if you can provide some insight, I'm really interested. Perhaps also take into account the latest measurements and conclusions I posted in this #1178 (comment). It seemed to me like @eddiebergman's comments in the PR sounded as if the stuff I proposed seemed to him like an idea he'd be on board with. @eddiebergman, could you, maybe, please help us get on the same page? LE: I'm only guessing here, so I could be dead wrong, so please forgive me if I am. Do you, perhaps, think that I'm using Python's |
Hi @bbudescu let's still use the example of the two workers scenario. Assuming that we have two workers That at time 0:00, |
Motivation
Here's how a cpu load graph looks like for a multi-objective optimization session using the multi-fidelity facade that ran for about 46 hours an a 64 core machine (no hyperthreading,
n_workers=64
), finishing almost 20k trials on a bit over 16k distinct configurations (two rungs).One can see that cpu utilizations decreases to less than 50% after the first 12 hours. It then drops to under 40% after another 10 hours (by this time, 12.6k trials were finished in total).
Previous Discussion
I thought that another cause for this degradation in performance might be Hyperband, and thought that using ASHA (#1169) instead would help eliminate that hypothesis, however, after @eddiebergman's #1169 (comment), I understand the problem is caused by workers waiting to get another suggestion from the surrogate model
Potential solution
The text was updated successfully, but these errors were encountered: