-
Notifications
You must be signed in to change notification settings - Fork 62
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
8 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
409eda2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4334
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4125
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5417
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4167
nslayernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
59978
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10333
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10167
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10500
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10167
nslayernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
416390
nsbias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1166.5
nsbias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3042
nsbias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1208
nsbias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1000
nsbias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18063
nsbias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4084
nsbias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3958
nsbias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4250
nsbias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4125
nsbias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109325.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56041
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46084
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46375
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81834
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36229
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2056625
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2082416.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2056666.5
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1995458
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192802
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
172458
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
144854.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
148125
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
146125
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166789
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1157666
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1110395.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1128416.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1120208
nslayernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
516061
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3583
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3583.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4229.5
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3292
nslayernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
69748
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9125
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9000
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9209
nslayernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
470533
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15083
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14875
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16583
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14917
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
53475
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222375
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213084
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213250
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213520.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
267675
nsbias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
nsbias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
nsbias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
584
nsbias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
nsbias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17384
nsbias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1500
nsbias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1500
nsbias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1750
nsbias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1583
nsbias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
103376
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7041
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5625
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5709
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23093
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227583.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
230417
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228000
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215542
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
166208.5
nsdense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3916
nsdense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3875
nsdense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3834
nsdense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3834
nsdense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23533
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16708
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16750
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16791
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16625
nsdense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160718
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
577333
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
573417
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
579000
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
574042
nsdense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113474
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1432312.5
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1426250
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1425917
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1418000
nsdense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
211622
nslenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1046541
ns1068292
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
965500
ns983291
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1347458
ns1327542
ns1.02
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1290542
ns1373792
ns0.94
lenet(28, 28, 1, 64)/forward/GPU/CUDA
267857
ns281111
ns0.95
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5895833.5
ns6002271
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4588042
ns4660958.5
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4928187
ns5006354
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5737167
ns5624708
ns1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1066176
ns1151478.5
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
nsdense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
nsdense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
nsdense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
nsdense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23460
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2292
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
nsdense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169490.5
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5458
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4000
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5687.5
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6250
nslayernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
64594
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11083
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11333
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12041
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11083.5
nslayernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
444224
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6708
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6416
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7875
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6500
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
51136
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17583
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16958
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18145.5
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16916
nsgroupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
297812
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
31896
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8916
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8667
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9250
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8645.5
nsbatchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
155805
nsdense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64937.5
nsdense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
62625
nsdense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64500
nsdense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64667
nsdense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
110478.5
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
294791
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
279125
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
275479.5
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
280854.5
nsdense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
185224.5
nsmlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3152041.5
ns3387083
ns0.93
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3026187
ns3112854
ns0.97
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3022520.5
ns2905708
ns1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3964167
ns3940000
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
573818.5
ns570283
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7551166.5
ns7636021
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7449979
ns7442000
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7447000
ns7380521
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8208396
ns8212750
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1327975
ns1364212
ns0.97
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18867458
ns13685833.5
ns1.38
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19142541
ns19094334
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19088834
ns19126041
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15711167
ns15649500.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24315583.5
ns23644021
ns1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33983500
ns34568146
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37046583.5
ns41693959
ns0.89
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34841833
ns34878583
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2130242
ns1840287
ns1.16
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
192387270.5
ns188357375
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
163943875
ns233488333
ns0.70
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
152577625
ns202742250
ns0.75
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
437847333
ns429823895.5
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
14119852
ns13939550
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
294725229.5
ns291377187.5
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
338344395.5
ns249397167
ns1.36
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
300590083.5
ns300701042
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
396800708.5
ns446062833
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23687.5
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23083
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24791
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23708
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95862
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103250
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103458
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
103667
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
102750
nslayernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
494978
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7083
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5750
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6875
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7000
nslayernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
67128
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15375
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15395.5
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16000
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14791.5
nslayernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
467877
nsConv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3009166.5
ns3055292
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2067250
ns2092833
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2279667
ns2283687.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4832667
ns4895416.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
581800.5
ns585359
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23921708.5
ns23561833
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18037292
ns18085229
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
16963187.5
ns18562458
ns0.91
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
34623770.5
ns35017833
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3105602
ns3105298.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33780291
ns33378229
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27715666.5
ns27662145.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27451041
ns27887458
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41640208
ns41809854.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
80479
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
72416
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
78354
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74645.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
100885
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
311542
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224520.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
209667
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
257021
nslayernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
539235
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12500
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11708
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12542
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12833.5
nslayernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
70648
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26667
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26958.5
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27333.5
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26625
nslayernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
470896
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12791
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12333
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13500
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12875
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
52214
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25959
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25750
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26500
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26500
nsgroupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
300818.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
180750
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
179583
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
183146
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179250
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56380
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
593542
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
582459
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
585042
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
594562
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
284588
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6770.5
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5958
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7084
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7125
nslayernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70103
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14709
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14500
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15291.5
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13958
nslayernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
460969.5
nsbatchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1217750
nsbatchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1209125
nsbatchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1249750
nsbatchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1326625
nsbatchedmm(512, Bsize=4)/forward/GPU/CUDA
302841
nsbatchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4351270.5
nsbatchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4353042
nsbatchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4630333
nsbatchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4466479
nsbatchedmm(512, Bsize=4)/zygote/GPU/CUDA
1039570
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1792
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1833
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
nsdense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23644
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5042
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
nsdense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
189061.5
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6021
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7042
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7416
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54998.5
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11437.5
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11084
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11666
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12333
nsgroupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
332242
nsdense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
nsdense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
nsdense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
333
nsdense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
nsdense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22998
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2667
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2750
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2709
nsdense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
158762.5
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13687.5
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11208
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13958
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
14125
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
57325
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24625
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24250
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25500
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24875
nsgroupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
295945
nsdense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
nsdense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4166
nsdense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
nsdense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
nsdense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24912
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16084
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16209
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16333.5
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16208
nsdense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
199034.5
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5708
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5584
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5708
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5708
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33099
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21166
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20458
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21333.5
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20875
nsbatchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
174613
nsbatchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
383042
nsbatchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
373541
nsbatchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
485896
nsbatchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
532854.5
nsbatchedmm(16, Bsize=512)/forward/GPU/CUDA
66578.5
nsbatchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
938166
nsbatchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
847083
nsbatchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1235042
nsbatchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1418833
nsbatchedmm(16, Bsize=512)/zygote/GPU/CUDA
191164
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81020.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80354.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82250
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
132458
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192525
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1945166
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1909584
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1920333
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1914354.5
nsgroupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
402795
nsdense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
nsdense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
nsdense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
nsdense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
nsdense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21790
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1791
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1916
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
nsdense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
172681
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8000
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6833
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8334
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7999.5
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
62227.5
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9375
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8875
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9625
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9250
nsgroupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
315550.5
nsConv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
159022167
ns121038500
ns1.31
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174256125
ns174268209
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147914021
ns155647417
ns0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
102407958
ns103289458
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5468366
ns5459016
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
678096083
ns592681937.5
ns1.14
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555598625
ns540116125
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
453528479
ns460022146
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
754205958.5
ns623412250
ns1.21
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34940005
ns38146652
ns0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
703546875
ns751859749.5
ns0.94
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
666832020.5
ns667614542
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
585927312.5
ns606980437.5
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
742692916
ns744028250
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57542
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47583
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47291
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82208
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37135
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1947333
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1971042
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1976458
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1893520.5
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
171380.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
272291
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
265834
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
289417
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
267167
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
135867.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
671917
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
596708
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
696292
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
692687.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
737698
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2231188
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2215042
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2207229
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2243770.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133226
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5572500
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5486875
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5511083
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5495666.5
nslayernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
759202.5
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
652833.5
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
657229
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
639500
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
639791
nsdense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46976
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1799583
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1724792
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1722792
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2103895.5
nsdense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
221178.5
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56541
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46833
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46041
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83792
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28073
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2058250
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2078709
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2093000
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1996646
nsbatchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
187152
nsConv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13406125
ns13355875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12455458
ns12430958.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12584792
ns12600937.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
14882959
ns15122729
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
517201.5
ns518849
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47687000
ns47134500
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41754625
ns41671875
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40922625
ns41125499.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58112708
ns58336333
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3212087
ns3218047
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74213479
ns74376750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68010000
ns68965000
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90988625
ns91496292
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76809750
ns98399104
ns0.78
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56917
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47042
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47041
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83375
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
46301
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1939854
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1973333
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1974729.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1884375
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
189579
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
250
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
31617
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6229.5
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6167
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6458
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6167
nsbatchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
171396
nsdense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
nsdense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
nsdense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
250
nsdense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
nsdense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31328
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2583
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2625
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2792
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
nsdense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161410
nsConv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
324182500
ns286107083.5
ns1.13
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339536042
ns339607208
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314625854
ns321183396
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
273060250
ns268796333
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7093070
ns7107764
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1051455583
ns971792250
ns1.08
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
941830875
ns922480542
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
858538271
ns835684104
ns1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1153691292
ns1117474583
ns1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34020243.5
ns33742759
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1359481562.5
ns1448964667
ns0.94
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1360673729
ns1371326875
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1640965792
ns1656412041
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1309802292
ns1663889000
ns0.79
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1414416.5
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1409541
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1408500
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1453875
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127358
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5056229
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5013583
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4954291
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5017021
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
601067
nsvgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
170719208
ns177405459
ns0.96
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
132607979.5
ns132546709
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
124493437.5
ns130053917
ns0.96
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
162230500
ns165568083
ns0.98
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4886055.5
ns4878153.5
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
854987208
ns643663333
ns1.33
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
644456708
ns496969000
ns1.30
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
532057834
ns558568375
ns0.95
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
687805708
ns654929750
ns1.05
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16138006
ns18110009
ns0.89
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
9114041.5
nsbatchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8770313
nsbatchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7860292
nsbatchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10147292
nsbatchedmm(512, Bsize=32)/forward/GPU/CUDA
1612586
nsbatchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
37546375
nsbatchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36886146
nsbatchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33451021
nsbatchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38875771
nsbatchedmm(512, Bsize=32)/zygote/GPU/CUDA
6459090.5
nsbias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47458.5
nsbias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
49333
nsbias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
49583
nsbias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47250
nsbias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18585
nsbias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50584
nsbias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50416
nsbias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50708.5
nsbias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50500
nsbias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
216293
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7979.5
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6791
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8875
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8583
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
106035
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10333
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9958
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10500
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10167
nsgroupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
612658
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8750
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6438
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8667
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5875
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
119844.5
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13375
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13000
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13416
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12791
nsgroupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
517417.5
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1042
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
958
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
31817
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8041
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8333
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8292
nsbatchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
203048
nsbias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23145.5
nsbias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
24541
nsbias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
24167
nsbias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23334
nsbias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18371
nsbias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52542
nsbias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52416
nsbias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52500
nsbias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52334
nsbias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
295739.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1440625
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1400291
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1400875
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1406313
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194620
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5047479.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5003458.5
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4836292
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4996708
nsgroupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
628014
nsConv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3062438
ns3064313
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2084417
ns2106875
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2227208.5
ns2301542
ns0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4812250
ns4944708.5
ns0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
579246
ns586671
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24741125
ns25694166
ns0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18811521
ns20092625.5
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18691437
ns19545895.5
ns0.96
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36587416
ns36568812
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3196070
ns3200820
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34435312
ns35138250
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28306583.5
ns28420084
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28069750
ns30280062.5
ns0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41958375
ns42544854.5
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
145325041
nsbatchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
141848041.5
nsbatchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
123758375
nsbatchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173196604
nsbatchedmm(512, Bsize=512)/forward/GPU/CUDA
22560824
nsbatchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
942531917
nsbatchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
871530625
nsbatchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1498315250
nsbatchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
674150833
nsbatchedmm(512, Bsize=512)/zygote/GPU/CUDA
118289465
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76208
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75041
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
77875
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75417
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
273038.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
299708
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
284646
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
191687.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
202979.5
nslayernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1439967
nsbatchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
36345458
nsbatchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35416645.5
nsbatchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32239562.5
nsbatchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40930312.5
nsbatchedmm(512, Bsize=128)/forward/GPU/CUDA
5849412
nsbatchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
151966416
nsbatchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
152232437.5
nsbatchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
136165208.5
nsbatchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287396625
nsbatchedmm(512, Bsize=128)/zygote/GPU/CUDA
34914778
nsConv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
158627833
ns120765334
ns1.31
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174511667
ns174275666
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148215771.5
ns156098417
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
108212479
ns103997770.5
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5459784
ns5461795.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
524328229.5
ns471697125
ns1.11
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
467038291
ns468205208
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
441190000
ns455789333
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
741818542
ns728998166
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32279915
ns35173763
ns0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
692549750
ns640412562.5
ns1.08
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
656203708.5
ns655505917
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
573625208
ns590476187.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
853537834
ns732032000
ns1.17
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1226937.5
ns1249541
ns0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
992979
ns949958.5
ns1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
904625
ns764125
ns1.18
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2085917
ns2000458
ns1.04
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
566912.5
ns568299.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2909667
ns2960792
ns0.98
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2628208
ns2611021
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2006333.5
ns2513020.5
ns0.80
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3693750.5
ns3690271
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1796011.5
ns1319857
ns1.36
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6757875
ns6641791
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6503250
ns6504791
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6239125
ns6489375
ns0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4454771
ns4443166
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6167
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6208
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24809.5
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213666
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220313
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220125
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
209542
nsbatchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
276995.5
nsvgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
315354292
ns309099062.5
ns1.02
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
221860750
ns232469666.5
ns0.95
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
197740833.5
ns216377833
ns0.91
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
312004542
ns308762583
ns1.01
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676221
ns7672114
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1085627020.5
ns1103432604
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
891084375.5
ns1001458208
ns0.89
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
865730125
ns901919771
ns0.96
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1163266979.5
ns1293921625
ns0.90
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26544800.5
ns27115979
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6083
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5583
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7375
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5270.5
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
178949
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7708
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7292
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6792
nsgroupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
667282.5
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
542
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
459
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
459
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23245
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9583.5
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9167
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9458.5
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8792
nsbatchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
227149
nsbias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352521.5
nsbias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352709
nsbias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352958.5
nsbias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352708
nsbias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21007
nsbias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
828104
nsbias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
820292
nsbias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
773500
nsbias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
828312
nsbias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
289596
nsbatchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
312083.5
nsbatchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
340166.5
nsbatchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
445354
nsbatchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
333520.5
nsbatchedmm(16, Bsize=32)/forward/GPU/CUDA
17918
nsbatchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
691583
nsbatchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
732334
nsbatchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1026459
nsbatchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
691042
nsbatchedmm(16, Bsize=32)/zygote/GPU/CUDA
273557
nsbatchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
332396
nsbatchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
348875
nsbatchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
409541
nsbatchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
375250
nsbatchedmm(16, Bsize=128)/forward/GPU/CUDA
22378
nsbatchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
755875
nsbatchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
743000
nsbatchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1068417
nsbatchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
822124.5
nsbatchedmm(16, Bsize=128)/zygote/GPU/CUDA
239682
nsbias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3625
nsbias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3417
nsbias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3583
nsbias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
nsbias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17823
nsbias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
nsbias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4167
nsbias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4375
nsbias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4292
nsbias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
271995
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4792
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3834
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5250
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3625
nslayernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
214003.5
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8354.5
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8334
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8667
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8417
nslayernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1200425
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
204209
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210000
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
211875
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199417
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34086
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
608520.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
620750
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
620416
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
628625
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
347622
nsbatchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
980000
nsbatchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
929916.5
nsbatchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
954250
nsbatchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1278542
nsbatchedmm(128, Bsize=128)/forward/GPU/CUDA
206777
nsbatchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4651729
nsbatchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4500083
nsbatchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4296645.5
nsbatchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6216979.5
nsbatchedmm(128, Bsize=128)/zygote/GPU/CUDA
942518
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3916
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3375
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4667
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3354.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
231395.5
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7375
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7292
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7667
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7000
nslayernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1002762
nsConv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1644583
ns1618667
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1174458
ns1189854.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1323125
ns1358375
ns0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2461333.5
ns2360458
ns1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213304.5
ns211422.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12444729.5
ns12284958.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9564709
ns9550979.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9234833
ns9390791
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18020417
ns18060041.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1940786
ns1906624.5
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17431792
ns17280916
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14392958.5
ns14329167
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14240000
ns14463083
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21049562.5
ns21088375
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
90625
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88041
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92333
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136917
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125618
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2061125
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2018458
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1720042
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2024104
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1024038
nsbatchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
331312
nsbatchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
343500
nsbatchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
395083
nsbatchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
310458.5
nsbatchedmm(2, Bsize=4)/forward/GPU/CUDA
15733
nsbatchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
699959
nsbatchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
722062.5
nsbatchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1018209
nsbatchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
646375
nsbatchedmm(2, Bsize=4)/zygote/GPU/CUDA
189475.5
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5958
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5875
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33239
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221625
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219959
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219750
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218375
nsbatchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
314279
nsdense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3750
nsdense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
nsdense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
nsdense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3667
nsdense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22722
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14167
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14334
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14291
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14375
nsdense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
475447
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
95166.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
91833
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96125
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
139167
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125450
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1948250
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1921104.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1669729.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1920708.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
954893.5
nslenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
854375
ns861145.5
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
817542
ns826334
ns0.99
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1213833.5
ns1164604.5
ns1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
958895.5
ns959395.5
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
276078
ns263975.5
ns1.05
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2843334
ns2730708
ns1.04
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2456145.5
ns2455708.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3332000
ns3317604.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3419792
ns3286521.5
ns1.04
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1629171
ns1038213
ns1.57
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15333
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
14709
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17041
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14333
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142609.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
262125
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215416.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215250
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221958
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
641081.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221583.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
218625
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222833
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221750
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
271537.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
497750
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
494833
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
497084
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
509000
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1365399
nsbatchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
315729
nsbatchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
333917
nsbatchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
375125
nsbatchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
322083
nsbatchedmm(16, Bsize=4)/forward/GPU/CUDA
16846
nsbatchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
710041
nsbatchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
725063
nsbatchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1022417
nsbatchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
663021
nsbatchedmm(16, Bsize=4)/zygote/GPU/CUDA
196884
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17625
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16708
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18792
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17625
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144721
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220104.5
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212792
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212750
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217250
nsgroupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
955774
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6042
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4250
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6958
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6541
nslayernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
245177
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10583.5
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10250
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10708
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10084
nslayernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1099715
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4542
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3208
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4834
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
2875
nslayernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
250616.5
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7125
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7375
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7375
nslayernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1110249
nsConv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
24293729.5
ns23602937.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34647499.5
ns34462041.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
38065167
ns41206708
ns0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34799687.5
ns34998812.5
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1834951
ns1861561
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
187799375
ns184955020.5
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159175458
ns159249771
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146555271
ns150499917
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
415008291
ns390550250
ns1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16504056.5
ns16472871
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
437855250
ns286689500
ns1.53
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
254443000
ns244388646
ns1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
231693624.5
ns296120917
ns0.78
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
485497958
ns440533417
ns1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184229.5
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181916
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184084
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
182167
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
230730
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
637084
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
586270.5
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
586583
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631542
nsgroupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1097701
nsbatchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3894562.5
nsbatchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3827292
nsbatchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3469958
nsbatchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5353020.5
nsbatchedmm(128, Bsize=512)/forward/GPU/CUDA
535365
nsbatchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
18146250
nsbatchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17166041.5
nsbatchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16601417
nsbatchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22202083
nsbatchedmm(128, Bsize=512)/zygote/GPU/CUDA
2616593
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
458
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
500
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32123
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9458
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8667
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9167
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9208
nsbatchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
267754
nsvgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
580762562.5
ns624998521
ns0.93
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
427173312.5
ns477642917
ns0.89
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
376948624.5
ns411867812.5
ns0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
671986666.5
ns656030104
ns1.02
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12479261
ns12477905
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2061821458.5
ns1873735437.5
ns1.10
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1626836125
ns1636021583
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1500724875
ns1558895000
ns0.96
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2217147562.5
ns2103890062.5
ns1.05
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
48947892
ns49609571
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1651250
ns1650167
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1196959
ns1195708
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1346187.5
ns1388458
ns0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2356042
ns2498125
ns0.94
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
218070
ns218867
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12822417
ns12700771
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9953541.5
ns9962124.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9605000
ns9800459
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18408062.5
ns18403354
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2047696.5
ns1957280
ns1.05
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17771104.5
ns17702708
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14762729
ns14737000
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14473917
ns14865041
ns0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21336042
ns21477333.5
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26209
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26583
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26209
nsdense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24922
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66792
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
67000
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66791
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66916
nsdense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
410676.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203542
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210583
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210500
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199958
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26405
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602333
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
621292
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621250
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630584
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
355627
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
657646
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
638729
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
544125
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
677396
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132242
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2305542
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2254292
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1426250
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2248542
nslayernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1182706
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17937.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17042
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19500
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16895.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144900
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
220000
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
218416.5
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
219458
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
261708
nsgroupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1051792
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
459
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
459
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
458
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23475
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9520.5
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9541
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10166
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9375
nsbatchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
261505
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6542
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5292
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6625
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7416
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
235631
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7000
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7291
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7250
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7208
nsgroupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
803793
nsbias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2334
nsbias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2041
nsbias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2292
nsbias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2333
nsbias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18245.5
nsbias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6750
nsbias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6459
nsbias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6667
nsbias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6625
nsbias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
333087.5
nsbias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
748458
nsbias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746645.5
nsbias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
746833
nsbias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
749417
nsbias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21817
nsbias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
789125.5
nsbias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
772625
nsbias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
775145.5
nsbias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
787875
nsbias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
298327
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5959
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5750
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10792
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32858
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221541
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226958
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
226625
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220292
nsbatchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
360131.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10250
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9917
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12459
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10583.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
243730.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24834
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24833.5
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24750
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24666
nslayernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1133764
nsConv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
107061375
ns106272125
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
116928479.5
ns117220895.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
121136000
ns123891541
ns0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117635875
ns117462292
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2659433
ns2638590.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
396814083.5
ns390984854
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
366591458
ns370181584
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
425794499.5
ns344393625
ns1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
482285959
ns481330584
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15258375
ns15192721.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
769963270.5
ns619409458
ns1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
576371708
ns668415479
ns0.86
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
745582312
ns816519375
ns0.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
765495854.5
ns916595917
ns0.84
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7333
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6334
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7750
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8333
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
237972
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14125
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13209
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
13417
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13459
nsgroupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1080162
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7667
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5583
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8167
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
8291
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
233794.5
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12542
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11875
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12645.5
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11875
nsgroupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
787815
nsbatchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
332667
nsbatchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
344396
nsbatchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
395770.5
nsbatchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
312500
nsbatchedmm(2, Bsize=128)/forward/GPU/CUDA
16497
nsbatchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
706958.5
nsbatchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
725208
nsbatchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1019750
nsbatchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
658292
nsbatchedmm(2, Bsize=128)/zygote/GPU/CUDA
198046.5
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
22951
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6542
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6208
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6792
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6208
nsbatchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
237567.5
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5709
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5667
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5875
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5667
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24038
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21958
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20875
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21625
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21125
nsbatchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
260574.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146812.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143875
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145917
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
178146
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166659.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1355917
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1329374.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
861416.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1325916
nslayernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1338261
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23084
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
21458
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24042
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
23958
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
350919.5
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
179500
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
120541
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118167
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
151208
nslayernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1454020.5
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
291
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22580
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6291
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6334
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6791
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6208
nsbatchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
253799.5
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5042
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4250
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5833.5
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4666
nslayernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
254794.5
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10042
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10042
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10125
nslayernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1352736
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1583
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1584
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1542
nsdense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23495
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5708
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5667
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5750
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5625
nsdense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
273637.5
nsConv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6842458
ns6779291.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6343020.5
ns6365500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6507417
ns6531583
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7623042
ns7635875
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213659
ns210025
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24131500
ns24055375
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21298104
ns21237625
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21004749.5
ns21535792
ns0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29792896
ns29721771
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2117701
ns1973993
ns1.07
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37668083
ns37426416
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34323688
ns34385895.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45641000
ns45888792
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38230313
ns49367041.5
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6459
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5250
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7500
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7458
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
235380.5
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8541
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7792
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8292
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9208
nsgroupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1057995
nslenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1525083
ns1528208
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1258604.5
ns1277937.5
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1613917
ns1635937.5
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2159167
ns2136917
ns1.01
lenet(28, 28, 1, 128)/forward/GPU/CUDA
273469.5
ns277390.5
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7971979
ns7872250
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6561833.5
ns6588000
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7004875
ns7229396.5
ns0.97
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10476458
ns10478041
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1860749
ns1130644
ns1.65
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
326083.5
nsbatchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
347292
nsbatchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
379020.5
nsbatchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
343562.5
nsbatchedmm(128, Bsize=4)/forward/GPU/CUDA
46613.5
nsbatchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
745458
nsbatchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
781417
nsbatchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1067437.5
nsbatchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
751125
nsbatchedmm(128, Bsize=4)/zygote/GPU/CUDA
306721.5
nsdense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396333
nsdense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287916
nsdense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288062.5
nsdense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
751542
nsdense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43483
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
646375
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
531834
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
530042
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973417
nsdense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188389
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
653542
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
639041.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
545542
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
655584
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131455.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2529917
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2399708
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2436833
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2460520.5
nslayernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1513461
nsbatchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
323146
nsbatchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
343771
nsbatchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
394750
nsbatchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
310562
nsbatchedmm(2, Bsize=32)/forward/GPU/CUDA
15996
nsbatchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
699000
nsbatchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
717792
nsbatchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1016334
nsbatchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
649937
nsbatchedmm(2, Bsize=32)/zygote/GPU/CUDA
196510
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458958
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1506167
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1503458
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1442834
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
39862
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5157334
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5010437.5
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4993104
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4988542
nsbatchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197580.5
nsdense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3709
nsdense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
nsdense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3667
nsdense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
nsdense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32748
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14833
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15125
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15292
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15041
nsdense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
374855
nsdense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71625
nsdense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71333
nsdense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71333
nsdense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71333
nsdense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113422
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
326208
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
318250
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
319375
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
317917
nsdense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
192316
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
959
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1000
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23450
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8042
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7895.5
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8333
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7792
nsbatchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
258455
nsbatchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
465250
nsbatchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
472750
nsbatchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
547875
nsbatchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
554667
nsbatchedmm(128, Bsize=32)/forward/GPU/CUDA
130091
nsbatchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1420208
nsbatchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1378895.5
nsbatchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1600250
nsbatchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1587791
nsbatchedmm(128, Bsize=32)/zygote/GPU/CUDA
274988
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
334
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
292
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31336
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6625
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
5959
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6354.5
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6166
nsbatchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
261129.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1730708
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1721229.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1723750
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1730229
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168441.5
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4400167
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4366354
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3903958
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4358458
nslayernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1240708
nsbias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6792
nsbias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6584
nsbias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6833
nsbias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
14542
nsbias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20531
nsbias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
32708
nsbias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
67708
nsbias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
32833
nsbias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51667
nsbias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
291979.5
nsbatchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
336292
nsbatchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
347187.5
nsbatchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
415021
nsbatchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
324666.5
nsbatchedmm(2, Bsize=512)/forward/GPU/CUDA
18102.5
nsbatchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
718416.5
nsbatchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
727250
nsbatchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1030292
nsbatchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
672709
nsbatchedmm(2, Bsize=512)/zygote/GPU/CUDA
346719.5
nsdense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75667
nsdense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75208
nsdense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75375
nsdense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75000
nsdense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46739
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
333209
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
331291
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
332729.5
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324292
nsdense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
208913
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1483875
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1531875
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1529458
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1467834
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51266
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5149875
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5290166.5
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5287000
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4982583
nsbatchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202737.5
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28291
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28167
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28291
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28167
nsdense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24497
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66625
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66542
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66500
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66500
nsdense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
532969
nsmlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1260875
ns1396583.5
ns0.90
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1118417
ns1097333
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1056541
ns939062.5
ns1.13
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2256375
ns2231792
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
573252
ns574483.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3028208
ns2873417
ns1.05
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2726937.5
ns2715208
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2733875
ns2626645.5
ns1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3818500
ns3813542
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1997088
ns1401203
ns1.43
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8958062.5
ns8821895.5
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8813834
ns8770604
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8742917
ns8763666.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6350021
ns6350229.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82895.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80270.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82875
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80167
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192999
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2045708.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2026499.5
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2015875
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2005042
nsgroupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
797613
nsThis comment was automatically generated by workflow using github-action-benchmark.