-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix metric calculation with multiple GPUs for semantic segmentation #175
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR ! I went through your code and I have concerns regarding the order in which you gather the metrics from each process and the calls to .reset()
. I have not tested your modifications on a machine but I suspect some of these changes break the single-device behavior. Have you checked the behavior and metrics are still the same in the single-device scenario ? Please see my comments in the code for more details.
Also, if you feel capable, would you mind having a shot at the distributed panoptic segmentation metrics too (ie src/models/semantic.py
) ? I did not look much into it but presume you should be able to accumulate the internal states from PanopticQuality3D
and MeanAveragePrecision3D
objects from multiple devices by leveraging the .update()
method of each of these classes (see their respective implems to understand what is happening under the hood).
Yes, you are right! I did not pay attention to the influence of my code on single-device training. My project requires vectorization and modeling after the semantic segmentation and I moved to this task once I got an acceptable model. I have pushed a new commit that fixes these concerns. For the inspection of panoptic segmentation, I would like to help. But I am not sure when I have time to work on that ... I will try to do it asap. |
Super, thanks for your prompt modifications ! Be sure I really appreciate the help ! Have you tested on your machine for the single-gpu scenario ? Did you check whether anything breaks and whether the metrics are coherent before and after the modifications ? Regarding the panoptic segmentation metrics, may I leave this PR open for now to give you a bit of time to have a shot at it ? |
I tested the training in DDP and single-gpu scenario last night and the training works well. Now the metrics are logged coherently in both cases. With the default configuration file on KITTI-360 dataset (200 epoch), the single-gpu achieves 54.43 best val miou while DDP stops at 52.11. I suppose this is casued by the difference of batch size and learning rate. But I am sure that the error of inconsistence between val_miou and best_val_miou no longer exsists. For the inspection of panoptic segmentation, I suggest to close this PR if you do not mind. Once I finish the modification I will launch a new one. As so far I have no guarantee I could finish that in a short time. |
Hi @Shanci-Li, OK please give me a few days to also test your code on my machine before I can merge your PR. Thanks for your contribution 😉 |
What does this PR do?
This PR gathers the confusion matrix computed on the unique mini-batch assigned to each GPU when training with DDP and then calculates the metrics based on the updated confusion matrix.
Fixes #165
Before submitting
pytest
command?pre-commit run -a
command?Did you have fun?
Make sure you had fun coding 🙃