Fix QAT resume with BN models, checkpoint name #260
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed these bugs recently and this pr is independent of the bug fixes and updates I mentioned in last week's meeting.
When resuming a checkpoint, a new optimizer is created from the model. As batchnorm fusing reduces models' parameter sizes, when resuming from a qat_checkpoint new optimizer have less parameters than the optimizer state_dict at the checkpoint.
Therefore, I added a update_optimizer function to call at initiate_qat state to strip off the batchnorm parameters from the optimizer.
Also, after resuming a qat_checkpoint, checkpoint names were incorrect. This pr includes a simple fix for that as well.
Limitations: