Skip to content

Release v2.11.1

Latest
Compare
Choose a tag to compare
@why-in-Shanghaitech why-in-Shanghaitech released this 12 Apr 13:48
· 1 commit to dev-v2.10-slurm since this release

This patch has fixed some problems since v2.11.

Fix

Resume

Now we support resuming experiments for slurm training service. Thanks to the slurm system, your existing trials won't be affected by the failure of NNI manager. If your NNI stops due to error, your existing trials will continue to run. You may resume the experiment later using nnictl resume <experiment_id> (docs) or python script (docs). This patch will read the trial logs and update the status, instead of ignoring the running trials. Below is an example of NNI timeline with trial concurrency of 2.

Time line ---+--------------+-----------+---------+-----------+----------------+------------------------------

User      Start NNI -- Go to sleep                                      Wake up, Resume ----------------------

NNI          +--------------------------+------ Error                       Resume -----------------+---------

Trial 1      +---------------------- Finish
Trial 2      +----------------------------------------------Finish      Register Result
Trial 3                                 +----------------------------- Register Progress -------- Finish
Trial 4                                                                        +------------------------------
Trial 5                                                                                             +---------

Notice that if you stops the experiment manually (e.g. with command nnictl stop ...), NNI will cancel all the running trials in order to release the resources.

Wandb Upload

Now we do wandb sync in a more elegent way: each time only synchronize a delta of the trial logs. It may help accelerate the upload and reduce the system wordload.

Trouble Shooting

Error: tuner_command_channel: Tuner did not connect in 10 seconds. Please check tuner (dispatcher) log.

In most cases, this error means that the login node is too slow (heavy workload on CPU and memory). This patch has extended the connection time to 120 seconds.