[Bug] mmrazor.engine.runner.quantization_loops.QATValLoop
calls after_val_epoch
hook twice with different keys, causing mmengine.hooks.checkpoint_hook._save_best_checkpoint()
to fail with KeyError
for the save_best
config
#637
Labels
bug
Something isn't working
Describe the bug
During QAT training of models with config files publicly available, such as RTMPose-tiny, I ran into this issue where the original file has the config:
default_hooks.checkpoint.save_best='coco/AP'
This works normally in non-quantized training.
However when inheriting the
_base_
in the QAT config,mmrazor.engine.runner.quantization_loops.QATValLoop
callsafter_val_epoch
hook twice with different keys as seen herecausing
mmengine.hooks.checkpoint_hook._savebest_checkpoint()
to failunlesssave_best
is overwritten with 'auto'.edit: 'auto' still causes only the first occurrence to be set as
key_indicator
from this line.However, one might need a different setting besides 'auto' in some fringe circumstances. I suggest the code only attempts to save the best_checkpoint once, or that the prefix 'qat.' and 'original.' each be added temporarily to the
key_indicators
set key and then removed each time the hook is called.Reproduces the error - error message
Suggested fix
Edit: I changed the fix to prefer the qat metrics to save the best checkpoint rather than the architecture metrics.
The only issue is that the after_val_epoch hook doesn't get called for the
QATarchitecture model and some of the logs are incomplete because of that.The text was updated successfully, but these errors were encountered: