Why Group Normalization works?
- PyTorch 1.4+
Group Normalization is a normalization technique where adjacent channel statistics are used to normalize the channels present in that particular group. Group normalization becomes Layer normalization when all channels are used to compute mean and standard deviation and becomes Instance normalization when only a single channel is used to compute mean and standard deviation. The main claim of Group Norm is that adjacent channels are not independent.
I wanted to investigate the fact whether Group Norm really takes advantage of adjacent channel statistics. Turns out it does.
To verify the claim I use ResNet-18 for Cifar-10 from this repository.
Compared to Vanilla Group Norm on the left, Group Shuffle Norm(right) picks channels that are not adjacent to each other. So a particular group consists of channels which are not adjacent to each other.
- Batch-Size = 128
- Step Size = 0.001
- Number of Groups = 32 (for both Group Norm and Group Shuffle Norm)
Model | Train Acc. | Test Acc. |
---|---|---|
Group Normalization | 96.17% | 85.38% |
Group Shuffle Normalization | 90.38% | 82.36% |
Hence Group Norm takes advantage of nearby channels which in-turn gives better results.