This patch has fixed some problems since v2.11.
Fix
Resume
Now we support resuming experiments for slurm training service. Thanks to the slurm system, your existing trials won't be affected by the failure of NNI manager. If your NNI stops due to error, your existing trials will continue to run. You may resume the experiment later using nnictl resume <experiment_id>
(docs) or python script (docs). This patch will read the trial logs and update the status, instead of ignoring the running trials. Below is an example of NNI timeline with trial concurrency of 2.
Time line ---+--------------+-----------+---------+-----------+----------------+------------------------------
User Start NNI -- Go to sleep Wake up, Resume ----------------------
NNI +--------------------------+------ Error Resume -----------------+---------
Trial 1 +---------------------- Finish
Trial 2 +----------------------------------------------Finish Register Result
Trial 3 +----------------------------- Register Progress -------- Finish
Trial 4 +------------------------------
Trial 5 +---------
Notice that if you stops the experiment manually (e.g. with command nnictl stop ...
), NNI will cancel all the running trials in order to release the resources.
Wandb Upload
Now we do wandb sync
in a more elegent way: each time only synchronize a delta of the trial logs. It may help accelerate the upload and reduce the system wordload.
Trouble Shooting
Error: tuner_command_channel: Tuner did not connect in 10 seconds. Please check tuner (dispatcher) log.
In most cases, this error means that the login node is too slow (heavy workload on CPU and memory). This patch has extended the connection time to 120 seconds.