Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get checkpoint based on create time is probably not a good idea #331

Open
hvgazula opened this issue May 10, 2024 · 6 comments
Open

Get checkpoint based on create time is probably not a good idea #331

hvgazula opened this issue May 10, 2024 · 6 comments
Assignees

Comments

@hvgazula
Copy link
Contributor

hvgazula commented May 10, 2024

latest = max(checkpoints, key=os.path.getctime)

What am I trying to do?
Initialize from a previous checkpoint, to resume training over more epochs.

For example, the following snippet

try:
        bem = Segmentation.init_with_checkpoints(
        "unet",
        model_args=dict(batchnorm=True),
        checkpoint_filepath=checkpoint_filepath,
    )
except:
        bem = Segmentation(
            unet,
            model_args=dict(batchnorm=True),
            multi_gpu=True,
            checkpoint_filepath=checkpoint_filepath,
        )

should initialize from a checkpoint if the checkpoint_filepath exists. However, the getctime part conflicts with other folders created during training (could be predictions or other folders).

Solution:

  • Need a more robust way to look to checkpoints
@hvgazula hvgazula self-assigned this May 10, 2024
@hvgazula
Copy link
Contributor Author

hvgazula commented May 10, 2024

In a nutshell, resumption from an existing checkpoint using API tools is still not working/clean. Works just fine with the tf inbuiltBackupAndRestore callback.

@hvgazula
Copy link
Contributor Author

Also see #332

@hvgazula
Copy link
Contributor Author

appending / to checkpoint_filepath resolved this. see neuronets/nobrainer_training_scripts@d5d1de0 🤦‍♂️

@hvgazula hvgazula reopened this May 11, 2024
@hvgazula
Copy link
Contributor Author

the getctime function only works if the checkpoint filepath has epoch in it's name..

For example: if checkpoint_filepath = f"output/{output_dirname}/nobrainer_ckpts/" + "{epoch:02d}" then the output (in addition to other folders) will look as follows:

Screenshot 2024-05-11 at 5 14 45 PM

Explanation of the folders:

  1. backup is the backandrestore callback (this will go away now)
  2. logs are tboard logs
  3. model_ckpts is me saving the model weights at the end of each epoch
  4. this is the modelcheckpoint (provided by the api)..but looks like i could do the same as step 3 with a few extra flags.
  5. predictions are plots or outputs at test time done right after each epoch.. this can be separated (if needed) if we have checkpoint from every epoch in step 4.

Summary:

  1. Setting a checkpoint_filepath with epoch and doing away with 3 will enable loading from checkpoints cleanly. Else, we may want to write improved logic for load when no folders are created for each epoch.

@hvgazula
Copy link
Contributor Author

hvgazula commented May 11, 2024

In hindsight, we should include 'BackupAndRestore' in addition to ModelCheckPoint, because the latter only saves a checkpoint at the end of each epoch. This will not be enough if the model passes through the entire data and fails just before writing whereas BackupAndRestore has a save_freq argument that can be taken advantage of.

@hvgazula
Copy link
Contributor Author

hvgazula commented May 11, 2024

ouch keras-team/tf-keras#430. Looks like, we will have to stay put with ModelCheckPoint for now. 😞 This is because I intend to save the best model as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant