Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dice Loss vs Dice Loss with smooting term #42

Open
valosekj opened this issue Feb 14, 2024 · 12 comments
Open

Dice Loss vs Dice Loss with smooting term #42

valosekj opened this issue Feb 14, 2024 · 12 comments

Comments

@valosekj
Copy link
Member

valosekj commented Feb 14, 2024

This issue discusses differences in the implementation of the Dice Loss with and without the smoothing term.

Background why opening this issue/discussion

tl;dr:

  • nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer (i.e., without the smoothing term of the Dice loss) helped the model from collapsing to zero during lesion model training.
Details

Since the default nnUNetTrainer trainer was collapsing to zero when training the DCM (degenerative cervical myelopathy) lesion segmentation model, we tried nnUNetTrainerDiceCELoss_noSmooth (i.e., without the smoothing term of the Dice loss).
This trainer was discovered by @naga-karthik in these two nnunet threads (1, 2). The trainer indeed helped, and the model was no longer collapsing to zero; see details in this issue.

Note that DCM lesion segmentation presents a high-class imbalance (lesions are small objects).

Comparison of the default and nnUNetTrainerDiceCELoss_noSmooth trainers

tl;dr:

  • the default nnUNetTrainer trainer uses smooth: float = 1.
  • nnUNetTrainerDiceCELoss_noSmooth uses 'smooth': 0
Details

nnunetv2 default trainer

The nnunetv2 default trainer uses MemoryEfficientSoftDiceLoss (see L352-L362 in nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py).

This MemoryEfficientSoftDiceLoss (see L58 in nnunetv2/training/loss/dice.py) uses both smoothing term (self.smooth) and small constant (1e-8); see L116:

dc = (2 * intersect + self.smooth) / (torch.clip(sum_gt + sum_pred + self.smooth, 1e-8))

nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer

The nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer (see L32 in nnunetv2/training/nnUNetTrainer/variants/loss/nnUNetTrainerDiceLoss.py) sets smooth to 0. The small constant (1e-8) is apparently untouched and kept.

What is the smoothing term used for?

tl;dr: hard to say convincingly.

  • keras and ivadomed use only the smoothing term without the small constant
  • nnunetv2 default trainer uses both the smoothing term and the small constant
  • nnunetv2 nnUNetTrainerDiceCELoss_noSmooth trainer uses only the small constant (because the smoothing term is set to zero)
Details

Initially, I incorrectly thought that the nnunetv2 smoothing term was used to prevent division by zero. I got this sense based on this comment. But, after a deeper look at the equation in this comment, I found out that the equation uses only the smoothing term but no small constant. Further investigation led me to these two discussions (1, 2) about the Dice implementation in keras. Both discussions use only the smoothing term but again, but no small constant:

score = (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

Checking the ivadomed Dice implementation, and finding that it also uses only the smoothing term (see L63 in ivadomed/losses.py):

return - (2.0 * intersection + self.smooth) / (iflat.sum() + tflat.sum() + self.smooth)

I also found this comment from Charley Gros providing the following explanation (note that this comment is related to the ivadomed Dice without the small constant):

A very probable reason is the different way these two functions are dealing with empty GT and empty pred.
--> Dice loss returns 1

Both keras and ivadomed implementations are in contrast with the nnunet implementation, which uses both smoothing term (self.smooth) and small constant (1e-8); see L116:

dc = (2 * intersect + self.smooth) / (torch.clip(sum_gt + sum_pred + self.smooth, 1e-8))

Prompting chatGPT to explain why removing the smoothing term from the nnunet dice helped from collapsing to zero, provides a relatively reasonable explanation (highlighted in italics):

The smoothing term, as used here, serves a dual purpose:

  1. Stabilization of the Dice Coefficient: By adding the smoothing term (self.smooth) to both the numerator (nominator) and the denominator, it ensures that even when there are no true positives (TP) (e.g., the model predicts no lesion where there is actually no lesion, or it fails to predict a lesion where there is one), the Dice coefficient does not become undefined or result in division by zero. The use of torch.clip with a minimum value of 1e-8 further ensures that the denominator cannot be zero, which prevents division by zero errors.
  2. Influence on Gradient Behavior and Model Sensitivity: The smoothing term affects the gradient of the loss with respect to the model's predictions. When TP is very small or zero, the addition of the smoothing term to both the numerator and denominator ensures that the model still receives a gradient signal that encourages it to improve its predictions. It essentially softens the harshness of the Dice loss, making it more forgiving in the early stages of training or in cases where the model struggles to detect small or sparse lesions. This can lead to a more stable and gradual learning process.

In your specific case, removing the smoothing term with the nnUNetTrainerDiceCELoss_noSmooth trainer likely made the loss more directly reflective of the actual performance of the model in terms of the overlap between the predicted segmentation and the ground truth. For the task of lesion segmentation, where lesions can be small and the balance between lesion and non-lesion areas is crucial, the direct feedback without the smoothing term's moderation might have better aligned the loss with the task's objectives, thereby improving model performance.


Further investigation and experiments comparing the nnunet default nnUNetTrainer trainer and nnUNetTrainerDiceCELoss_noSmooth are in progress.

Tagging @naga-karthik and @plbenveniste, who both also work on lesion segmentation. If any of you had time to go through the investigation above to check if I didn't make any naive mistakes, it would be great.

@jcohenadad
Copy link
Member

also tagging @Nilser3

@hermancollin
Copy link
Contributor

What do you mean by collapsing to zero? Class imbalance was too high so that the model output zeros everywhere after 1000 epochs?
If so, what is the behavior of the best checkpoints (as opposed to final checkpoint)?

@valosekj
Copy link
Member Author

What do you mean by collapsing to zero? Class imbalance was too high so that the model output zeros everywhere after 1000 epochs?

The model was crashing to zero after 100-250 epochs, depending on the fold. See the training progress in this comment.

@hermancollin
Copy link
Contributor

hermancollin commented Feb 14, 2024

@valosekj ok. nnunet struggles with your second class (which I'm guessing is the lesion class).

Have you tried opening an issue or discussion on the nnunet repo? Last time I checked I remember the main contributor was still pretty active. He might have some good insights on this phenomenon? Because that behavior is a little bit weird

@valosekj
Copy link
Member Author

@valosekj ok. nnunet struggles with your second class (which I'm guessing is the lesion class).

Exactly!

Have you tried opening an issue or discussion on the nnunet repo? Last time I checked I remember the main contributor was still pretty active. He might have some good insights on this phenomenon? Because that behavior is a little bit weird

We solved the collapsing to zero by using the nnUNetTrainerDiceCELoss_noSmooth trainer based on these two nnunet threads (1, 2), as I tried to describe in the first comment. If the first comment is unclear, please let me know, and I will rephrase it.
This discussion aims to figure out what smoothing is responsible for and why removing it helped in model training.

@hermancollin
Copy link
Contributor

hermancollin commented Feb 15, 2024

Does this only happen with region-based training? We trained a model on very small objects, although the class imbalance was maybe less pronounced than yours (without collapse)

@valosekj
Copy link
Member Author

Does this only happen with region-based training? We trained a model on very small objects, although the class imbalance was maybe less pronounced than yours (without collapse)

  • For the region-based model, the training was collapsing to zero after 100-250 epochs (details here).
  • For the multi-channel model (trained using T2w_ax image and spinal cord segmentation as input channels to segment lesions as output), the model was learning nothing (details here). Changing the trainer to nnUNetTrainerDiceCELoss_noSmooth for the multi-channel model helped; see here.

@valosekj
Copy link
Member Author

Looking into MONAI DiceLoss:

f: torch.Tensor = 1.0 - (2.0 * intersection + self.smooth_nr) / (denominator + self.smooth_dr)

where

smooth_nr: a small constant added to the numerator to avoid zero.
smooth_dr: a small constant added to the denominator to avoid nan.

with default values

smooth_nr: float = 1e-5,
smooth_dr: float = 1e-5,

This indicates that both "smoothing" terms in MONAI implementation are basically just small constants allowing the division. This is in contrast with the "smoothing" term equal to 1 as used in keras, ivadomed, and nnunetv2.

@naga-karthik
Copy link
Member

naga-karthik commented Feb 16, 2024

"smoothing" term equal to 1 as used in keras, ivadomed, and nnunetv2.

I think this is the major issue with the dice loss implementations in those packages. Having a big term (i.e. 1) is interfering with the loss calculation (and consequently the gradient signals passed through the network) when learning to segment small, heavily class-imbalanced objects (i.e. lesions)

@hermancollin
Copy link
Contributor

hermancollin commented Feb 16, 2024

very interesting. In this comment, Fabian reports similar problem on the LIDC dataset, which is a lesion segmentation task like yours. From my understanding, the Dice loss can fail in 2 ways:

  • intersection=0; in this case, we would get a dice loss of 0, regardless of if the GT/pred are empty (in which case we would like to have a value of 1 instead of 0, has Charley mentioned, hence the smoothing term in the numerator)
  • addition=0 (in the denominator): this would give us a dice loss of NaN, but the smoothness term makes that impossible.

Based on this, we can safely say the problematic part is the intersection. I think the fact that your dataset and the LIDC datasets are problematic is because of this intersection term. Because your masks are mostly empty, this intersection is very close to 0 (remember the dice loss takes a softmax as input - not a binary mask, so the intersection CAN be in [0,1]). The signal is too weak and, as @naga-karthik mentioned, the smoothness=1 term overshadows the weak signal you have inside the intersection term.

Maybe something to try would be to hardcode a different smoothness term in the dice computation. I reckon a smaller value would not make the training collapse. If that is the case, we could report it back to the nnunet guys, as they didn't seem to know what was going on.

@naga-karthik
Copy link
Member

Thanks for your thoughts, @hermancollin ! I think we can safely proceed with how MONAI has implemented DiceLoss (i.e. setting smoothing to a small constant such as 1e-5, which should be small enough to work with lesion segmentation problems and others where the object-to-segment is large

@function2-llx
Copy link

I think we can safely proceed with how MONAI has implemented DiceLoss (i.e. setting smoothing to a small constant such as 1e-5

FYI, in that comment Fabian explicitly mentioned that 1e-5 may not work.

The 1e-8 should probably not be there and it should use clip instead. No idea why this causes a problem with the default smooth of 1e-5 and does not cause problems with smooth=0.

Also, nnUNet does not use a default smooth value of 1, but actually 1e-5. Indeed it defined the default value in __init__ here with smooth=1:

class SoftDiceLoss(nn.Module):
    def __init__(self, apply_nonlin: Callable = None, batch_dice: bool = False, do_bg: bool = True, smooth: float = 1.,

However, the actually used value is defined here in the nnUNetTrainer, which is 1e-5:

    def _build_loss(self):
        if self.label_manager.has_regions:
            loss = DC_and_BCE_loss({},
                                   {'batch_dice': self.configuration_manager.batch_dice,
                                    'do_bg': True, 'smooth': 1e-5, 'ddp': self.is_ddp},
                                   use_ignore_label=self.label_manager.ignore_label is not None,
                                   dice_class=MemoryEfficientSoftDiceLoss)
        else:
            loss = DC_and_CE_loss({'batch_dice': self.configuration_manager.batch_dice,
                                   'smooth': 1e-5, 'do_bg': False, 'ddp': self.is_ddp}, {}, weight_ce=1, weight_dice=1,
                                  ignore_label=self.label_manager.ignore_label, dice_class=MemoryEfficientSoftDiceLoss)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants