Get checkpoint based on create time is probably not a good idea #331

hvgazula · 2024-05-10T04:44:14Z

nobrainer/nobrainer/processing/checkpoint.py

Line 57 in 976691d

latest = max(checkpoints, key=os.path.getctime)

What am I trying to do?
Initialize from a previous checkpoint, to resume training over more epochs.

For example, the following snippet

try:
        bem = Segmentation.init_with_checkpoints(
        "unet",
        model_args=dict(batchnorm=True),
        checkpoint_filepath=checkpoint_filepath,
    )
except:
        bem = Segmentation(
            unet,
            model_args=dict(batchnorm=True),
            multi_gpu=True,
            checkpoint_filepath=checkpoint_filepath,
        )

should initialize from a checkpoint if the checkpoint_filepath exists. However, the getctime part conflicts with other folders created during training (could be predictions or other folders).

Solution:

Need a more robust way to look to checkpoints

The text was updated successfully, but these errors were encountered:

hvgazula · 2024-05-10T04:49:29Z

In a nutshell, resumption from an existing checkpoint using API tools is still not working/clean. Works just fine with the tf inbuiltBackupAndRestore callback.

hvgazula · 2024-05-10T10:41:51Z

Also see #332

hvgazula · 2024-05-10T12:39:07Z

appending / to checkpoint_filepath resolved this. see neuronets/nobrainer_training_scripts@d5d1de0 🤦‍♂️

hvgazula · 2024-05-11T21:18:24Z

the getctime function only works if the checkpoint filepath has epoch in it's name..

For example: if checkpoint_filepath = f"output/{output_dirname}/nobrainer_ckpts/" + "{epoch:02d}" then the output (in addition to other folders) will look as follows:

Explanation of the folders:

backup is the backandrestore callback (this will go away now)
logs are tboard logs
model_ckpts is me saving the model weights at the end of each epoch
this is the modelcheckpoint (provided by the api)..but looks like i could do the same as step 3 with a few extra flags.
predictions are plots or outputs at test time done right after each epoch.. this can be separated (if needed) if we have checkpoint from every epoch in step 4.

Summary:

Setting a checkpoint_filepath with epoch and doing away with 3 will enable loading from checkpoints cleanly. Else, we may want to write improved logic for load when no folders are created for each epoch.

hvgazula · 2024-05-11T21:23:35Z

In hindsight, we should include 'BackupAndRestore' in addition to ModelCheckPoint, because the latter only saves a checkpoint at the end of each epoch. This will not be enough if the model passes through the entire data and fails just before writing whereas BackupAndRestore has a save_freq argument that can be taken advantage of.

hvgazula · 2024-05-11T21:32:46Z

ouch keras-team/tf-keras#430. Looks like, we will have to stay put with ModelCheckPoint for now. 😞 This is because I intend to save the best model as well.

hvgazula self-assigned this May 10, 2024

hvgazula mentioned this issue May 10, 2024

Integrate initializing from checkpoint #332

Closed

hvgazula closed this as completed May 10, 2024

hvgazula reopened this May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get checkpoint based on create time is probably not a good idea #331

Get checkpoint based on create time is probably not a good idea #331

hvgazula commented May 10, 2024 •

edited

Loading

hvgazula commented May 10, 2024 •

edited

Loading

hvgazula commented May 10, 2024

hvgazula commented May 10, 2024

hvgazula commented May 11, 2024

hvgazula commented May 11, 2024 •

edited

Loading

hvgazula commented May 11, 2024 •

edited

Loading

Get checkpoint based on create time is probably not a good idea #331

Get checkpoint based on create time is probably not a good idea #331

Comments

hvgazula commented May 10, 2024 • edited Loading

hvgazula commented May 10, 2024 • edited Loading

hvgazula commented May 10, 2024

hvgazula commented May 10, 2024

hvgazula commented May 11, 2024

hvgazula commented May 11, 2024 • edited Loading

hvgazula commented May 11, 2024 • edited Loading

hvgazula commented May 10, 2024 •

edited

Loading

hvgazula commented May 10, 2024 •

edited

Loading

hvgazula commented May 11, 2024 •

edited

Loading

hvgazula commented May 11, 2024 •

edited

Loading