Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix metric calculation with multiple GPUs for semantic segmentation #175

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Shanci-Li
Copy link

@Shanci-Li Shanci-Li commented Nov 26, 2024

What does this PR do?

This PR gathers the confusion matrix computed on the unique mini-batch assigned to each GPU when training with DDP and then calculates the metrics based on the updated confusion matrix.

Fixes #165

Before submitting

  • Did you make sure title is self-explanatory and the description concisely explains the PR?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you test your PR locally with pytest command?
  • Did you run pre-commit hooks with pre-commit run -a command?

Did you have fun?

Make sure you had fun coding 🙃

src/models/semantic.py Outdated Show resolved Hide resolved
Copy link
Owner

@drprojects drprojects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR ! I went through your code and I have concerns regarding the order in which you gather the metrics from each process and the calls to .reset(). I have not tested your modifications on a machine but I suspect some of these changes break the single-device behavior. Have you checked the behavior and metrics are still the same in the single-device scenario ? Please see my comments in the code for more details.

Also, if you feel capable, would you mind having a shot at the distributed panoptic segmentation metrics too (ie src/models/semantic.py) ? I did not look much into it but presume you should be able to accumulate the internal states from PanopticQuality3D and MeanAveragePrecision3D objects from multiple devices by leveraging the .update() method of each of these classes (see their respective implems to understand what is happening under the hood).

src/models/semantic.py Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
src/models/semantic.py Show resolved Hide resolved
src/models/semantic.py Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
src/models/semantic.py Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
src/models/semantic.py Outdated Show resolved Hide resolved
@Shanci-Li
Copy link
Author

Yes, you are right! I did not pay attention to the influence of my code on single-device training. My project requires vectorization and modeling after the semantic segmentation and I moved to this task once I got an acceptable model. I have pushed a new commit that fixes these concerns.

For the inspection of panoptic segmentation, I would like to help. But I am not sure when I have time to work on that ... I will try to do it asap.

@drprojects
Copy link
Owner

Super, thanks for your prompt modifications ! Be sure I really appreciate the help !

Have you tested on your machine for the single-gpu scenario ? Did you check whether anything breaks and whether the metrics are coherent before and after the modifications ?

Regarding the panoptic segmentation metrics, may I leave this PR open for now to give you a bit of time to have a shot at it ?

@Shanci-Li
Copy link
Author

I tested the training in DDP and single-gpu scenario last night and the training works well. Now the metrics are logged coherently in both cases. With the default configuration file on KITTI-360 dataset (200 epoch), the single-gpu achieves 54.43 best val miou while DDP stops at 52.11. I suppose this is casued by the difference of batch size and learning rate. But I am sure that the error of inconsistence between val_miou and best_val_miou no longer exsists.

For the inspection of panoptic segmentation, I suggest to close this PR if you do not mind. Once I finish the modification I will launch a new one. As so far I have no guarantee I could finish that in a short time.

@drprojects
Copy link
Owner

Hi @Shanci-Li, OK please give me a few days to also test your code on my machine before I can merge your PR. Thanks for your contribution 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

val/miou_best metric goes wrong when training with multiple GPU
2 participants