-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: use [sources] in Project.toml (#1090)
- Loading branch information
Showing
15 changed files
with
101 additions
and
138 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
2331c99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4125
ns3792
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4292
ns4084
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4875
ns4834
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4188
ns3959
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61773
ns61509.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10375
ns10500
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10250
ns10541
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10709
ns10250
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10584
ns10250
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
433806
ns431498.5
ns1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1209
ns1062.5
ns1.14
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1208
ns1167
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1334
ns1417
ns0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1333
ns1208
ns1.10
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18632
ns18573
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3958
ns4000
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3770.5
ns4000
ns0.94
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4250
ns4209
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3750
ns3750
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
111653
ns111184
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57167
ns57750
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46708
ns38542
ns1.21
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47042
ns46583
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85000
ns82208
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37778
ns37503.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2021166.5
ns2037645.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2091833
ns2095625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090417
ns1844375
ns1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2037250
ns2001375
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197839
ns196039
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144125
ns145583
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
143687.5
ns143584
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
145875
ns146458
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144542
ns145000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166264.5
ns168190
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
815917
ns1114291
ns0.73
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1110583
ns1150292
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1128458
ns805500
ns1.40
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1161791.5
ns1122750
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
531966.5
ns526921
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3834
ns3292
ns1.16
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3667
ns3666
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4208
ns4167
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3875
ns3500
ns1.11
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
72027
ns72235.5
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9666
ns10125
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9208
ns8375
ns1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9667
ns8792
ns1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8791
ns8833
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
495388.5
ns480020
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17250
ns14875
ns1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15292
ns15000
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17750
ns17520.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14875
ns14583
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54800
ns53914
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213334
ns214792
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213667
ns214875
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215625
ns214750
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213125
ns226813
ns0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
273384.5
ns272785
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
625
ns625
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns625
ns0.80
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
834
ns917
ns0.91
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns459
ns1.18
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17538
ns17774
ns0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1459
ns1792
ns0.81
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1625
ns1417
ns1.15
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1541
ns1709
ns0.90
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1584
ns1417
ns1.12
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
101749
ns102929.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
6625
ns7167
ns0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5833
ns5250
ns1.11
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6000
ns6000
ns1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10541
ns10000
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23308
ns23666
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230042
ns225187.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228000
ns237479.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229917
ns229334
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215459
ns226709
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
167869.5
ns168739
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3959
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23769
ns23839
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16625
ns16792
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16645.5
ns16833
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16916
ns16958
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16542
ns16750
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
160993.5
ns161365
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
583542
ns571458
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
582166
ns576000
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
573083
ns574041
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
578334
ns571458
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
112908
ns113559.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1416417
ns1425375
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1413563
ns1418875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1420000
ns1418958
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1427041.5
ns1422750
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
209512.5
ns210833
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1074937.5
ns1076645.5
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
961625
ns934291
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1349604
ns1340187.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1275750
ns1294270.5
ns0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA
272786
ns271656
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5988250
ns5796417
ns1.03
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4453229
ns4651792
ns0.96
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4954875
ns4918209
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5751250
ns5515938
ns1.04
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1067705
ns1071316.5
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns583
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
583
ns500
ns1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns500
ns1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23552
ns23948.5
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2167
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2209
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
171901
ns169153
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4208.5
ns3625
ns1.16
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4417
ns4084
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5042
ns4687.5
ns1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4166
ns3709
ns1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65093
ns66303.5
ns0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11292
ns11270.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11292
ns11417
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11875
ns11625
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11417
ns10667
ns1.07
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
448429
ns456550
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7020.5
ns6312.5
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7041
ns6770.5
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7625
ns7792
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6500
ns7083
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52253
ns52528
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16979.5
ns18375
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17833
ns17833
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18875
ns17791
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16875
ns16833
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301549.5
ns301396
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
584
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns542
ns1.15
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns584
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32680
ns32972
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8750
ns9020.5
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8834
ns8459
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9625
ns9041
ns1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8667
ns8708
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
156693
ns159042.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64125
ns64542
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64291
ns64895.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64458
ns64292
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64584
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111163
ns110877
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
280625
ns284875
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
274250
ns297937.5
ns0.92
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
278083
ns282333
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
289292
ns274104.5
ns1.06
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
184761.5
ns184904.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3374250
ns3295541
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3022020.5
ns2811062.5
ns1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3033167
ns3016125
ns1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4059271.5
ns3935209
ns1.03
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
577014
ns572132
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7622583.5
ns7478250
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7400875
ns7348937.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7463083
ns7339479.5
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8222208
ns8212959
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1350413
ns1367334
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18744750
ns18775625
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19149375
ns19121334
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19037709
ns19108667
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15854917
ns15653542
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23424208
ns23560250
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33648791
ns42472875
ns0.79
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37255625
ns37127771
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35462146
ns34865500
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1854361
ns1862818
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189507459
ns188025167
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
163150563
ns176960479.5
ns0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
151759708
ns152823708
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
449307375
ns441336000
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13915090
ns13912250
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
290474792
ns290589750
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
338390437.5
ns276449542
ns1.22
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
298728666
ns296753875
ns1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
400176437.5
ns333259041
ns1.20
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24666
ns22875
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
23062.5
ns23333
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25125
ns24125
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21833
ns23542
ns0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
95619.5
ns98041.5
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103041
ns103625
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103750
ns135834
ns0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104584
ns105084
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
104146
ns103250
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
500114.5
ns518052
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6042
ns6209
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6500
ns6500
ns1
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6667
ns7041.5
ns0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5958
ns5959
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68217
ns70884
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14833
ns15084
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16208
ns15708
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16542
ns16250
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15541.5
ns14770.5
ns1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
474515
ns492747
ns0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3028583
ns3001020.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2072250
ns2085333
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2258958
ns2274000
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4727250
ns4550083
ns1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
581996.5
ns589071
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23485750
ns23511750
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18074583
ns18279542
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17953667
ns16979209
ns1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36188354.5
ns35598583
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3102669
ns3111231
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33313750
ns33266500
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27588229.5
ns28064750
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27385167
ns27365500
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42266896
ns41824541.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72125
ns71750
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75625
ns74021
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75209
ns74875
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72313
ns73458
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102770.5
ns104698
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217709
ns314125.5
ns0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
264292
ns212229
ns1.25
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
208812
ns323000
ns0.65
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216750
ns218042
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
548643
ns559024
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11834
ns11625
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13750
ns12292
ns1.12
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12208
ns12500
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11791.5
ns11875
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71431.5
ns73943
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26500
ns26583
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
27375
ns26667
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28000
ns27708
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27167
ns26666
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
474755
ns493150
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12292
ns12208
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13250
ns12896
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13625
ns13916
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12625
ns12500
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53420
ns54608
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25708
ns26125
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26084
ns26000
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26375
ns25916.5
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26209
ns26000
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
305780
ns315887.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181833
ns179208
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182750
ns183145.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
182000
ns183166
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179750
ns180125
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
56584
ns58575
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
582667
ns582958.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
589020.5
ns596541.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
585562.5
ns583833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582875
ns582834
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
286509.5
ns294599.5
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5958
ns6292
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7000
ns6459
ns1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6917
ns6750
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6167
ns6041
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
71314
ns72806
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14041.5
ns14542
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15042
ns13333
ns1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15334
ns15667
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15042
ns14333
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
465404.5
ns482192.5
ns0.97
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1163666
ns1177728.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1608417
ns1356208.5
ns1.19
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1245958
ns1250750
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1315062.5
ns1317541
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301860.5
ns301448
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4119833.5
ns4117688
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4367812.5
ns4491417
ns0.97
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4633625
ns4696854.5
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4681521
ns4452542
ns1.05
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1040008
ns1051206.5
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1916
ns1875
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23628.5
ns24165
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns5000
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
ns4958
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns4917
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
188198
ns194564.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5959
ns6041
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6333
ns6000
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6584
ns6145.5
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5625
ns5958
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
55698
ns57313.5
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10958
ns11979.5
ns0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11875
ns11854.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11667
ns11042
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11041.5
ns11292
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
330993.5
ns342366
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
334
ns333
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
375
ns333
ns1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23016
ns23004
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2791
ns3000
ns0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns2750
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3083
ns3000
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2792
ns2750
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
158081
ns159207
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12000
ns11583
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12292
ns11292
ns1.09
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12979
ns13437.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11500
ns11708.5
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
56764.5
ns57286.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25250
ns25312.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
25292
ns25083
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25542
ns25334
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25125
ns25167
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
293131
ns296722
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4209
ns4208
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4250
ns4167
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24851
ns25099
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16084
ns16125
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16084
ns16041
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16250
ns16166
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16042
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
193865.5
ns199370.5
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5833
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5833
ns5833
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5833
ns5792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5833
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
33648.5
ns33986
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20937.5
ns21083
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20875
ns21125
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21375
ns21208
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20833
ns20667
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
175295.5
ns176941.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
405354.5
ns396792
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
383146
ns354313
ns1.08
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
487375
ns489167
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
505333
ns521584
ns0.97
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67095
ns66831
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
921500
ns1005417
ns0.92
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
879833.5
ns876583
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1239500
ns1235667
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1413875
ns1420854
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
190914
ns191762.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80792
ns80250
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
80625
ns80209
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82416.5
ns84167
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82208.5
ns81125
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193084
ns193433
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921166
ns1916083
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1923375
ns1933854
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1702792
ns1917917
ns0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1942625
ns1923708.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
397267
ns409629
ns0.97
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22298
ns22197
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
171128.5
ns170854.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6750
ns6791
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7125
ns6417
ns1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7750
ns7375
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6583
ns6959
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60207.5
ns61202
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9334
ns9291.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9458
ns9166.5
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9458
ns9375
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9500
ns9334
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
309332.5
ns313492.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
118908083
ns120748834
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
173905459
ns181703729
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148147000
ns148437750
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
104063562
ns104851584
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5483006
ns5474996
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
615077271
ns616853125
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
556251208
ns579539270.5
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
456191166.5
ns451846854.5
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
775264354
ns757165312.5
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38217009
ns34944567
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
651954834
ns649889209
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
668816521
ns688661771
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
584471208
ns592710229
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
743364500
ns741917708
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59041
ns59750
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47167
ns38959
ns1.21
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
48042
ns48000
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85604.5
ns83416
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38577
ns37459
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921792
ns1922792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1983375
ns1985083
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1974021
ns1978104
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1888041.5
ns1893917
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
177270
ns174160
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
267667
ns290625
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
269500
ns266708
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
269000
ns271521
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
265375
ns268167
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
129439
ns132776.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602875
ns657229.5
ns0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
667625
ns681187.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589104
ns691583
ns0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
696166.5
ns597417
ns1.17
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
698695
ns713916
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2214416
ns2243937
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2132916.5
ns2191895.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2099687.5
ns2213542
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2218542
ns2180437.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
135139.5
ns133381
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5496500
ns5496875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5493084
ns5583292
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5512750
ns5498250
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5608375
ns5492750.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
786813
ns753967
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
645084
ns636833
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
646042
ns644417
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
643042
ns645333
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
645042
ns637292
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47537
ns46993.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1818666
ns1826042
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1720625
ns1667083
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1727375
ns1726542
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2097625
ns2105854.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
225809.5
ns222295
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58458
ns58500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46958
ns38708
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47500
ns47250
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85709
ns84292
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
29149.5
ns28598
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2024312
ns2031041
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2089792
ns2099020.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2079417
ns2091916.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2030812.5
ns1856417
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192873
ns190652
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13367875
ns13391395.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12448375
ns12453250
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12498688
ns12557375.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15196500
ns15140541
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
515450
ns514312
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47301125
ns47481750
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41737208
ns41986250
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41031917
ns40944792
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
59054000
ns57945917
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3246636.5
ns3259544
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
73864187.5
ns96867229.5
ns0.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
90734875
ns91436187.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90710083
ns90591917
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
99247604
ns76381625
ns1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58667
ns59083.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47292
ns38750
ns1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47625
ns47417
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
85416.5
ns84000
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47961
ns46955
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1915542
ns1925125
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1967250
ns1979250
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1778666.5
ns1970729.5
ns0.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1904791
ns1897750
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195659
ns191790.5
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns333
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32740
ns32566
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6167
ns6417
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6000
ns6458
ns0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6625
ns6459
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6042
ns6083
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
176130
ns174123.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns250
ns1.33
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31946
ns31409
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2833
ns0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns2791
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2916
ns2834
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2625
ns2583
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
164970
ns161269
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
286577604
ns286258979.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339468333
ns346927270.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314095271
ns313997291.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
270924375
ns270108416
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7117527
ns7104986
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1001221667
ns998016667
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
939877583
ns959348209
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
851361917
ns851652541.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1176703208
ns1162498166
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33887966
ns33999768
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1311845770.5
ns1672427541
ns0.78
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1679371125
ns1705785000
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1604290334
ns1631619209
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1668435000
ns1314128542
ns1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1415333.5
ns1406813
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1417520.5
ns1416875
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1416104
ns1459625
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1420146
ns1407750
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128175
ns127789
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5010542
ns5022896
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5020291.5
ns5051333
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5037500
ns5029542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5047042
ns5031875
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
595594
ns559312.5
ns1.06
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
175229188
ns169600250
ns1.03
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
123461167
ns180340396
ns0.68
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
127594250
ns130036124.5
ns0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
154552916.5
ns169790708.5
ns0.91
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4884050
ns5056885.5
ns0.97
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
667971584
ns669854958
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
641402625
ns604244667
ns1.06
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
501342541
ns501867209
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
657859875
ns684062709
ns0.96
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
15872908
ns16520518
ns0.96
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8987479.5
ns8950666
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8781270.5
ns8876958.5
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7857729
ns7849458.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10412374.5
ns10185417
ns1.02
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1592095
ns1594436
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36150584
ns36026541.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36797500
ns38047792
ns0.97
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33192666.5
ns33343417
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
40244625
ns38792000
ns1.04
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6455577
ns6457988
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47417
ns47417
ns1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47584
ns47375
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47583
ns47584
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47333
ns47333
ns1
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18534
ns18535
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
52833.5
ns50291
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50375
ns50375
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50666
ns50417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50250
ns50083
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
202850
ns191873
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7459
ns6458
ns1.16
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7417
ns6917
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7312.5
ns7750
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7458.5
ns6958
ns1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
98661
ns91345
ns1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9792
ns10458
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns9916
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10542
ns10084
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns10208
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
555252.5
ns527140.5
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6750
ns5625
ns1.20
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6042
ns5917
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7208.5
ns6958
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6542
ns5750
ns1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
104446.5
ns120543
ns0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13125
ns13583
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12917
ns13354.5
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13292
ns13458
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13083
ns13000
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
478181
ns537999
ns0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1125
ns1083
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32701
ns32473
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8375
ns7917
ns1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8125
ns7917
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns7959
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8083
ns8167
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
206369.5
ns206314.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23417
ns23437.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23500
ns23167
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23416
ns23584
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23333
ns23542
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18592
ns18671
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52750
ns52458
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
54709
ns52541
ns1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52917
ns53458
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52917
ns52062.5
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
283991
ns291832.5
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1399417
ns1458937
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1396395.5
ns1401583
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1396833
ns1403833.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1449874.5
ns1459708.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196187
ns195968
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5003208
ns5008771
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5005375
ns5044104
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5023834
ns5017250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5050167
ns5011916
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
585941
ns599687
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3039563
ns3061000
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2072875
ns2086750
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2275208
ns2304917
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4856479
ns4539041
ns1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
583070
ns581670
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24354562.5
ns24376958
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18867354
ns19122667
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18817521
ns19181062.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
37413770.5
ns36163041
ns1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3176919
ns3185287.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33990500
ns34039875
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28382208.5
ns28717291.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28070021
ns28156000
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42353875
ns41614584
ns1.02
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144782125
ns144831583
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
142800542
ns143542708
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
123809687.5
ns124983229.5
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
168891563
ns173618479
ns0.97
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22773536
ns22558463
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1277305063
ns1247182979
ns1.02
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1180173271
ns836595146
ns1.41
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
757990666
ns738893583
ns1.03
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
688381500
ns672803125
ns1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118470004
ns118329511
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75042
ns84666
ns0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73625
ns73666
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
77166
ns76146
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74708
ns75688
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
220284.5
ns240753.5
ns0.91
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
285750
ns287042
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
191208
ns212354
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
192209
ns296854
ns0.65
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
286417
ns284250
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1195118
ns1238105
ns0.97
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35568917
ns35497979
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35278833
ns35870917
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32149729
ns32110833
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
41733750
ns40961896
ns1.02
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5841675.5
ns5843453.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148531084
ns149169500
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
153045542
ns155980437.5
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
136231750
ns134845625
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
228329854.5
ns287434667
ns0.79
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34864707.5
ns34879809
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
119094187.5
ns121767709
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174236667
ns181613625
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147985917
ns148039291
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
107449375
ns104612333.5
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5482351
ns5485164
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
467600417
ns472118833
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
465577292
ns486130458.5
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
438034750
ns440650208
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
759816229.5
ns746192375
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35154520.5
ns32245076
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
709358854.5
ns643396416
ns1.10
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
655624271
ns675303249.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
571617791
ns575492166
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
869387791
ns856961334
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1327250.5
ns1312541
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
905875
ns677667
ns1.34
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
907750
ns963459
ns0.94
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2079042
ns2093375
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
578714.5
ns580070.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2967333.5
ns2966541.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2631479.5
ns2496854
ns1.05
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2620896
ns2623959
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3771729
ns3704083
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1755565
ns1730505
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6610917
ns6656375
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6496875
ns6477624.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6497437.5
ns6431167
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4521833
ns4450479.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7375
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6125
ns5417
ns1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6084
ns6084
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10542
ns9917
ns1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25575
ns25252
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212875
ns212583
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229500
ns229770.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221187.5
ns220500
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
246625
ns206083
ns1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
261769.5
ns251646.5
ns1.04
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
313730896
ns301644020.5
ns1.04
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
222537125
ns280942354.5
ns0.79
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
194707917
ns189363792
ns1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
313279354
ns305392479
ns1.03
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7673155
ns7676597
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1080950395.5
ns1087372208.5
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
899873458
ns980974208
ns0.92
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
834690333
ns865965209
ns0.96
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1180116917
ns1158600916.5
ns1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26459206.5
ns26533591
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5875
ns5354.5
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5417
ns5375
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6250
ns6917
ns0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6084
ns4958
ns1.23
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
162725
ns146657
ns1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7375
ns7395.5
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7084
ns7375
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7250
ns1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns7250
ns1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
624677.5
ns596011.5
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns584
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
666
ns625
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
23758
ns24031
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9542
ns8917
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9291
ns9708
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9584
ns9583
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9209
ns8833
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
225738
ns216620.5
ns1.04
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352000
ns353333
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352042
ns352041
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
354604.5
ns352666.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
353833
ns352417
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21344
ns21463
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
822291
ns820625
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
812479
ns828917
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
824250
ns774875
ns1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
831958
ns778729
ns1.07
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
304872
ns269469
ns1.13
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
337167
ns337187.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
343334
ns313687.5
ns1.09
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
446875
ns444709
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
316354.5
ns334500
ns0.95
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18389
ns17922
ns1.03
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
695521
ns689958
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
750792
ns746333
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1026833
ns1025042
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
688999.5
ns694854.5
ns0.99
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
282579.5
ns242950
ns1.16
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
356667
ns351417
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
354500
ns327270.5
ns1.08
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
421500
ns414729.5
ns1.02
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
347042
ns371750
ns0.93
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22715
ns22559
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
754229
ns747208
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
753792
ns749416
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1072417
ns1069374.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
823125
ns815937.5
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
256204.5
ns224503
ns1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3583
ns3708
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3500
ns3625
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3708
ns3750
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3667
ns3291
ns1.11
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17612
ns17855
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4208
ns4208
ns1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4500
ns4208
ns1.07
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4292
ns4333
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4417
ns4208
ns1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
280326.5
ns248489.5
ns1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4209
ns3708
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4333
ns4167
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4291
ns4791
ns0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4125
ns3792
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
232867.5
ns203806
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8291
ns8667
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns8250
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8562.5
ns8458
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8667
ns8667
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1214158.5
ns1166315.5
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203792
ns204875
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
211375
ns209750
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209083
ns209834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
202541
ns200000
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34629
ns34893
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
645167
ns602917
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623895.5
ns628833
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
630084
ns621584
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
633833
ns592041
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
349768
ns321942.5
ns1.09
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
972062.5
ns978791
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
937916.5
ns937250.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
960125
ns960250
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1319708
ns1307271
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208475
ns207418
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4500166
ns4504084
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4475687.5
ns4619604.5
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4308250
ns4294917
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6508250
ns6229292
ns1.04
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
944786.5
ns936037
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4084
ns3354
ns1.22
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3750
ns3583
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4083
ns4417
ns0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3542
ns3333
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
226002.5
ns196464
ns1.15
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7334
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7625
ns7417
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7291
ns1.05
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns6917
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1008436
ns985634
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1647479.5
ns1640792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1203104.5
ns1171541.5
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1378125
ns1327125
ns1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2472896
ns2384666
ns1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213582
ns216205.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12309291
ns12345499.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9565666
ns9603042
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9280334
ns9259895.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18216500
ns18032958.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1940596
ns1950941
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17356917
ns17348083
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14358625
ns14444583.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14329312.5
ns14302167
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21175541
ns21057645.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
133834
ns87666.5
ns1.53
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90000
ns89562
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
93687
ns90292
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
90750
ns88875
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125997
ns126565
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2019458
ns2024000
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2029375
ns2030958.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2029667
ns1707583
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2049458
ns2030042
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1042357
ns999913
ns1.04
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
347333
ns343750
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
349250
ns326145.5
ns1.07
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
394583
ns396833
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
293978.5
ns309896
ns0.95
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16455.5
ns16654
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
709041
ns702666
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
741583.5
ns733666
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1022875
ns1020166
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
644791
ns652500
ns0.99
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
197069.5
ns190386.5
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7416
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns5291
ns1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6041
ns6000
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10583
ns10041
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34401
ns34743
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224416.5
ns224334
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220375
ns229333
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231250
ns220959
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
236834
ns206292
ns1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
318034
ns296926
ns1.07
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3792
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3708
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
23219
ns23083
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14375
ns14416
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14375
ns14209
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14417
ns14292
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14167
ns14458
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
484400.5
ns448235
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
97417
ns92854
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
94042
ns99583
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
97959
ns94542
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
95500
ns96042
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125837
ns125978
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1920250
ns1920562.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1649417
ns1914937.5
ns0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1923437
ns1653792
ns1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1953916
ns1928541
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
974936
ns893203
ns1.09
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
879729.5
ns878750
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
832708
ns800021
ns1.04
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1229562.5
ns1221729
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
939583
ns963792
ns0.97
lenet(28, 28, 1, 32)/forward/GPU/CUDA
281248
ns277692.5
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2831145.5
ns2824834
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2527396
ns2464958
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3353354.5
ns3323271
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3411104.5
ns3398958
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1661947.5
ns1565101.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
14854.5
ns17667
ns0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15583
ns15458.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18792
ns17250.5
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16000
ns14645.5
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144462
ns142432.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
255958
ns218209
ns1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215583.5
ns222958.5
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
257583
ns216334
ns1.19
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
262500
ns215062.5
ns1.22
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
650445
ns637432
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221375
ns221145.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220792
ns222375
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
223083
ns220917
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
220646
ns220333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
273454.5
ns280530
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
559542
ns510354
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
510542
ns499375
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
507813
ns500021
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
535208.5
ns507041
ns1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1396532
ns1281236
ns1.09
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
328770.5
ns332250
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
336937
ns316000
ns1.07
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
370500
ns364333
ns1.02
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
299625
ns323834
ns0.93
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17616
ns17441
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
711834
ns715833.5
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
732166.5
ns735083
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1024479.5
ns1022959
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
657917
ns667041
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
200486.5
ns193588.5
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18520.5
ns18666
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19083
ns17375
ns1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19625
ns19167
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18396
ns17083.5
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
147224
ns147781
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213625
ns212542
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221875
ns214146
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221812.5
ns213834
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
237333
ns211354.5
ns1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
951211
ns877964
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4583
ns4083
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4417
ns4291.5
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4917
ns5375
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4437.5
ns3958
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
239868.5
ns169898
ns1.41
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10625
ns10834
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10500
ns10542
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10958
ns10583
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10625
ns10459
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1112681.5
ns993411.5
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3791
ns3417
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3541
ns3167
ns1.12
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4229.5
ns4375
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3833
ns3062.5
ns1.25
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
252769
ns203556.5
ns1.24
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7334
ns7791
ns0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns7458
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7916
ns7250
ns1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7437.5
ns7541
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1116124.5
ns1041955
ns1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23341875
ns23557729
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34053354.5
ns43140979
ns0.79
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37482854.5
ns37880833
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
35456625
ns34954917
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1845777.5
ns1859678
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184378291
ns184630708
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
158584667
ns172192624.5
ns0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146193479
ns146314396
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
422496166.5
ns415449708
ns1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16510255
ns16494786
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
426674167
ns428781042
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
253893875
ns259710791
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
232875895.5
ns231751208
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
494805750
ns484878833
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
184500
ns183625
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183458
ns183375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185583
ns184417
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183416.5
ns182667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
231684
ns177771.5
ns1.30
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
599042
ns590604
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
586312.5
ns588083
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
636833
ns586792
ns1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
641125
ns586958
ns1.09
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1087543.5
ns1015783.5
ns1.07
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3842645.5
ns3860917
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3643229
ns3732375
ns0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3509333
ns3478062.5
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5524187.5
ns5358854.5
ns1.03
batchedmm(128, Bsize=512)/forward/GPU/CUDA
534809
ns533317.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17462833
ns17452375
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17328500.5
ns17779209
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16632083
ns16551750
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
23474479.5
ns22184000
ns1.06
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2613903
ns2614491.5
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns625
ns0.80
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
584
ns584
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
666
ns625
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32551
ns32765
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns9625
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8625
ns9542
ns0.90
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9792
ns9625
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9354.5
ns8917
ns1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
264963
ns263711.5
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
500529917
ns501494042
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
429131021
ns411555459
ns1.04
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
390085458
ns374781084
ns1.04
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
680776812.5
ns672198042
ns1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12474289.5
ns12477100
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2050021916.5
ns2044775145.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1635602292
ns1660536667
ns0.98
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1501725478.5
ns1495631604
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2237822875
ns2221523375
ns1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49165291
ns49258137.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1648791.5
ns1643291
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1195792
ns1172917
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1379625
ns1391041.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2436187.5
ns2338333
ns1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215012
ns215612.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12725833.5
ns12698542
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9944041.5
ns9998999.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9667395.5
ns9717041
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18594104.5
ns18433792
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2038696
ns2039696
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17722000
ns17679687.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14694125
ns14770854.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14557833
ns14602583.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21533833
ns21327625
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26291
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26750
ns26250
ns1.02
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26292
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23955
ns24225
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66833
ns67250
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66625
ns66834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66958
ns68166
ns0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66833
ns66792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
403690.5
ns378162.5
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
202791
ns203125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209375
ns208500
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209667
ns208666
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200708
ns200125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26177
ns26005
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
612146
ns646625
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
622334
ns628813
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
680520.5
ns669895.5
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
634750
ns580791.5
ns1.09
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
350618
ns311381
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
650500
ns651667
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
542145.5
ns638666
ns0.85
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
634666
ns647417
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
679459
ns653083.5
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131917
ns131397
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2229542
ns2243375
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2231250
ns2314937.5
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2251687.5
ns2249625
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2330333
ns2235375
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1238942
ns1114755
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16854
ns18291
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19500
ns17500
ns1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19791.5
ns20917
ns0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17750
ns18292
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144506
ns143094
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
230625
ns223500
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
260583
ns226042
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
261125
ns262917
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
265583.5
ns230125
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1064679
ns943015
ns1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns666
ns0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23448
ns23380
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
10125
ns10104.5
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9792
ns10166
ns0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10000
ns10000
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9979
ns9583
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
257505.5
ns254915.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6125
ns5084
ns1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5625
ns5375
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6666
ns6791
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6084
ns5250
ns1.16
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
233944.5
ns190346.5
ns1.23
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7416
ns7250
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7334
ns7125
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7834
ns7250
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7417
ns7083
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
800597
ns735734
ns1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2209
ns2167
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2292
ns2208
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2209
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2250
ns2417
ns0.93
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17989
ns18111
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6541.5
ns6750
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6542
ns6375
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
7125
ns6625
ns1.08
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6750
ns6625
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
330052
ns306022.5
ns1.08
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
751958.5
ns751583.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746604.5
ns748875
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749167
ns746812.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
748959
ns748500
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21090
ns21064
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
791292
ns791834
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
792333
ns788667
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
773291
ns786646.5
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
792291.5
ns792479
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
299003.5
ns294710
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7291
ns7417
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5917
ns5208
ns1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns6000
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10791
ns10084
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33088.5
ns33108.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233333
ns228645.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229479
ns231416
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
269542
ns271625
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220958
ns225958
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
359587
ns351410
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10625
ns10292
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10375
ns10084
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10958
ns11166
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10959
ns10000
ns1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
249563.5
ns209596.5
ns1.19
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25042
ns24709
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24625
ns24333
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25375
ns24291
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
25250
ns24437.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1114585
ns1037550
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106488708
ns107199542
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
117008645.5
ns126347334
ns0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120350584
ns120468625
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
118085396
ns117762042
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2661446
ns2637816
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
393399750
ns393813416
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
368428125
ns380007916
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
359138458
ns355873375
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
486814000
ns484550250
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15211152
ns15152772.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
759103375
ns939763875
ns0.81
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
755373708
ns777743792
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
744752604
ns745742833
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
959286729.5
ns767071771.5
ns1.25
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6896
ns7167
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7791
ns6833
ns1.14
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8250
ns8458
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7583
ns7562.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
240721
ns228024
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14458
ns14250
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14208.5
ns14042
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14750
ns13875
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14312.5
ns13333
ns1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1072384
ns1000779
ns1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6292
ns6167
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6125
ns6125
ns1
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7291
ns8250
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6458
ns5604.5
ns1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
234548
ns214266.5
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12584
ns12417
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12625
ns12542
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12959
ns12875
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12583
ns12541
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
784420
ns724930
ns1.08
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
347708
ns349208
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
386916.5
ns326145.5
ns1.19
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
398834
ns393333
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
292375
ns314271
ns0.93
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16947
ns17228
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
708249.5
ns706500
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
746000
ns739437.5
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1025229
ns1020354
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
652416.5
ns658541
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
199954
ns198297
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23200
ns23935.5
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6458
ns6500
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6417
ns6584
ns0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6584
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns6250
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
238715
ns240134
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5958
ns5875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5917
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6000
ns5917
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5917
ns5834
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24219
ns24721
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21250
ns21500
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20875
ns21333
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21417
ns21292
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21896
ns21208
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
261648
ns262379.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145458
ns144229.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147521
ns144042
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
147770.5
ns147292
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
147458.5
ns145833
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167051
ns167351
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1322500
ns1320395.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1320041
ns1358771
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1325833
ns1324084
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1391458
ns1329333.5
ns1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1346456
ns1268788
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22250
ns24083
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24750
ns22375
ns1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23416
ns25104.5
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22396
ns21917
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
353387
ns280502
ns1.26
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
178709
ns131646
ns1.36
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
118687.5
ns121334
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
127459
ns177687.5
ns0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
134041.5
ns130209
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1464281
ns1380349
ns1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns416
ns0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22942
ns23199
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6459
ns6708
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6500
ns7083
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6708
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6750
ns6083
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
255510
ns258254.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5000
ns5042
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4729.5
ns4500
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5292
ns4917
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4709
ns4917
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256450
ns243109
ns1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10042
ns10375
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10167
ns10042
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10417
ns10125
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10125
ns10167
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1348843.5
ns1338362
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1584
ns1667
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1667
ns1625
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1667
ns1542
ns1.08
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22876
ns23629
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5625
ns5875
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5625
ns5666
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6041
ns5958
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5708
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
272214.5
ns278503
ns0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6888875
ns6825854.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6384792
ns6429125
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6514708.5
ns6541187.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7555583
ns7656375
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214320
ns215102
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24087271
ns24080834
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21278062.5
ns21338208
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21040583
ns21079333
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29921333
ns29660375
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2106395
ns2111008
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37396292
ns48564000
ns0.77
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45619104.5
ns45595770.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45717854
ns45721854
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49514208
ns38038271
ns1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6208
ns5687.5
ns1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6333
ns6041
ns1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6459
ns6917
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6083
ns5375
ns1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
236136
ns239823
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9125
ns8291
ns1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8666
ns8500
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8416
ns8750
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8583
ns8750
ns0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1059780
ns1069933
ns0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1497208
ns1555021
ns0.96
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1271146
ns1235375.5
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1623333
ns1618375
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2143312.5
ns2095209
ns1.02
lenet(28, 28, 1, 128)/forward/GPU/CUDA
273613.5
ns285020
ns0.96
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7900125
ns7898542
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6605479
ns6630645.5
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7156416.5
ns7200958
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10528062.5
ns10372854.5
ns1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1850752
ns1904820
ns0.97
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
343000
ns342000
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
349166.5
ns323833
ns1.08
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
383250
ns382208
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
325438
ns342042
ns0.95
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46572
ns43080
ns1.08
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
746124.5
ns725958
ns1.03
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
795499.5
ns782938
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1076208.5
ns1067750
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
753291.5
ns737041.5
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
309766
ns314201.5
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397375
ns397583
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
287916
ns211916
ns1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288000
ns288208
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
749125
ns750834
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44192
ns44587.5
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
666145.5
ns670500
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
531062.5
ns470708
ns1.13
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
529625
ns531792
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
975062.5
ns974083
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188202
ns192970
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
646708
ns651646
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
543166.5
ns644458.5
ns0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
654229
ns659271
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
659479
ns645333
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132313.5
ns132814
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2450208
ns2440750
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2447833
ns2525916.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2404020.5
ns2439124.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2562667
ns2464750
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1598744
ns1349058.5
ns1.19
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
347208
ns344292
ns1.01
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
347542
ns326104
ns1.07
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
400125
ns393875
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
291604
ns312896
ns0.93
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16522
ns16925
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
706875
ns709938
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
734333
ns739917
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1028542
ns1021708
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
647750
ns650083.5
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199294.5
ns202873.5
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1458584
ns1458625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1498042
ns1490666
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1499666
ns1498417
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1444167
ns1436416
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40454
ns41016
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5120438
ns5105458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5292292
ns5294583
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5286000
ns5292167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5017937.5
ns5007208
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195965.5
ns201135.5
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3750
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3709
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32802
ns33479.5
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15125
ns15292
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15292
ns15125
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15417
ns15291
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14875
ns15042
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
372915.5
ns381756.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
70917
ns71209
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71250
ns71250
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
70916
ns71125
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71375
ns70062.5
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
112608
ns114111
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
317750
ns318250
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
318417
ns329625
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318375
ns318708
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
327667
ns317958
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
192232
ns197229.5
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1000
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23208
ns24163
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns8167
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns8041
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8250
ns8667
ns0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns7625
ns1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
259321
ns264271.5
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
468417
ns464166.5
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
479458
ns448167
ns1.07
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
555416
ns553459
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
544792
ns548917
ns0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA
128776.5
ns129241.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1386166.5
ns1380229
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1391187.5
ns1393229
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1623687.5
ns1619541
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1644333.5
ns1590270.5
ns1.03
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
275740
ns277974
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
333
ns416
ns0.80
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns333
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31924
ns32417
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
5958
ns6375
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6167
ns6500
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6459
ns6542
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6166
ns5958
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
262594
ns267135
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1733625
ns1723834
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1722729.5
ns1731042
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1729958
ns1722458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1727000
ns1727375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168805
ns168945.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4353667
ns4366646
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4366916.5
ns4396958.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4362042
ns4374416.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4429395.5
ns4349500
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1264129.5
ns1192401
ns1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6959
ns6750
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6708
ns6541
ns1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7000
ns7292
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6833
ns6542
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20795
ns20406
ns1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51500
ns81771
ns0.63
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
38042
ns49083
ns0.78
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
47209
ns72271
ns0.65
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
48666.5
ns51334
ns0.95
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
295172.5
ns213340.5
ns1.38
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
355084
ns354167
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
350583
ns329541.5
ns1.06
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
423208.5
ns401083
ns1.06
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
295000
ns321771
ns0.92
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18329
ns18865
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
718562.5
ns722646.5
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
744125
ns740500
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1031500
ns1030625
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
672625
ns673875
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
347666.5
ns350549.5
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75042
ns75250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75250
ns75250
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75333
ns75458
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75584
ns75042
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46603
ns47823
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324708
ns324625
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
327334
ns341667
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
324375
ns324250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
334062.5
ns330833
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
207370
ns216202
ns0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1485208
ns1485500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1526250
ns1517334
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1526625
ns1526000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1467250
ns1463167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51906
ns53576
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5116396.5
ns5124354.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5284312.5
ns5278542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5277167
ns5287917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5025562.5
ns4986958
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
203896.5
ns209445
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28333
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28333
ns28250
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28625
ns28208
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28334
ns28291
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24422
ns25452
ns0.96
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66250
ns66333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66250
ns66250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66250
ns66250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66375
ns66333
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
519781.5
ns539628
ns0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1501250
ns1483687.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1125791
ns859791.5
ns1.31
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1125104.5
ns1143208
ns0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2259459
ns2247229.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
571991
ns585407
ns0.98
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3070000
ns3085000
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2775000
ns2591208
ns1.07
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2736500
ns2737895.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3899292
ns3816250
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2055229
ns2035890
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8838896
ns8818187.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8809083.5
ns8953500
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8782709
ns8776854
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6483958.5
ns6365041
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80583
ns80791
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81334
ns79875
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
83645.5
ns82792
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136500
ns80708
ns1.69
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192157
ns194256.5
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2012958
ns2013375
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2009583
ns1748958
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2015916.5
ns2018500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2051000
ns2022750
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
803108
ns809328
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.