-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: update exporting_to_jax.md (#1107)
- Loading branch information
Showing
1 changed file
with
2 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
78ad9c9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4291
ns4042
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3958
ns4042
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5125
ns5000
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4250
ns3917
ns1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60770
ns60335
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10250
ns10292
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns9958
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10917
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10334
ns9917
ns1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
423675
ns425045
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1125
ns1250
ns0.90
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1166
ns1125
ns1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1229.5
ns1417
ns0.87
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1250
ns1083
ns1.15
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
17992
ns17905
ns1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4250
ns4083
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4000
ns4000
ns1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4167
ns4375
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3958
ns3916
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109284
ns109347
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57417
ns56292
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38208
ns46833
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46375
ns46229.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80167
ns81458
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36667.5
ns36705
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2021709
ns2055229.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2097000
ns2092146
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2077875
ns2088791.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2001000
ns2005459
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195812
ns195507
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
145166.5
ns175854
ns0.83
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
142666
ns144666
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146500
ns145708
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144167
ns141167
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165803
ns165651
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1104750
ns1150750
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1156062
ns1127354.5
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1104750
ns1114250
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1129458
ns1116458.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
527714
ns529529
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4000
ns3208
ns1.25
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3625
ns3417
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4375
ns4208
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3459
ns3334
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70555.5
ns70388
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9084
ns8875
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8709
ns9500
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9667
ns9750
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9167
ns9250
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
481518.5
ns494790
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15416
ns15209
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16958
ns15000
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16791.5
ns17209
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14792
ns14688
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54315.5
ns54580
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213958
ns216291.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
214042
ns214167
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214208
ns213416
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214334
ns225708.5
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
273628
ns274273
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns709
ns0.71
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
583
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
667
ns834
ns0.80
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583.5
ns625
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17264
ns17190
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1500
ns1709
ns0.88
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1625
ns1375
ns1.18
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1792
ns1833
ns0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1708
ns1584
ns1.08
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
102318
ns102235
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7000
ns7125
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5084
ns5958
ns0.85
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958
ns5916
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
ns9958
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23961
ns23722
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221542
ns222833
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
229708.5
ns227958
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229667
ns229500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226542
ns213417
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
170388
ns169452
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns4000
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3916
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3917
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23385
ns23542
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16625
ns16834
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16500
ns16709
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17000
ns16959
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16833
ns16666
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
161544
ns162915
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
581791
ns571709
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
578709
ns574917
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
569958
ns573708
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
572333.5
ns568500
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113621
ns113185.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1428958
ns1427354.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1421292
ns1431625
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1415833
ns1423541
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1420000
ns1422542
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
210533
ns211963
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1081750
ns1046896
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
938708
ns967000
ns0.97
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1353291.5
ns1344687.5
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1296666
ns1304958
ns0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA
269675
ns275060
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5971292
ns5993167
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4530771.5
ns4544458
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4949917
ns4946959
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5624041
ns5568042
ns1.01
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1072622
ns1091420
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns583
ns0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23468
ns23913
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2125
ns2209
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2208
ns2208
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2084
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169303
ns169337.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4167
ns4417
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4208
ns3833
ns1.10
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4708
ns4625
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4125
ns3958.5
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66233.5
ns65443
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11125
ns11833
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11250
ns11250
ns1
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12000
ns11958
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10792
ns11125
ns0.97
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
452338
ns450871
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6292
ns7208
ns0.87
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6417
ns6958
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7604.5
ns7417
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5833
ns6333
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52542
ns51992
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
18583
ns18459
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17500
ns17500
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
18833
ns17833
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16833
ns17459
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
301964.5
ns300918
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns667
ns0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
32911
ns32212
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8625
ns9084
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8542
ns9437
ns0.91
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9125
ns9333
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8917
ns8958
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
160010
ns158990.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64500
ns64208
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64666
ns64833
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64500
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64500
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112101
ns111823
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
279458
ns282708
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
288583
ns279000
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
273583
ns273166
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
286083
ns281437.5
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
185547.5
ns186218.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3376750.5
ns3136750
ns1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2898291.5
ns3023208
ns0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3024854
ns3030188
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3941104
ns3954583.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
581323
ns576992
ns1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7603583
ns7597041.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7358750
ns7419792
ns0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7466208
ns7452395.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8146792
ns8186583.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1318419
ns1367306
ns0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17484792
ns17658333
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17670999.5
ns17553062.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17533250
ns17551250
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
9220187.5
ns14310208
ns0.64
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23603916
ns23729167
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43639208
ns33388291
ns1.31
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37125083
ns37228104.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34980187.5
ns34843354
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1854234
ns1868338
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
188207417
ns192271333
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
251666438
ns232983250
ns1.08
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
194864208
ns191886562.5
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434287708
ns435397084
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13931919
ns13905970
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
287943833
ns291433625
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
355406479.5
ns336814583
ns1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
297803834
ns297436208
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
400767145.5
ns408923438
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22458
ns22583
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22208
ns24708
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25041
ns23209
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22270.5
ns21625
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96107.5
ns99141.5
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
113166.5
ns103334
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
104292
ns103750
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
105083
ns105083
ns1
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103812.5
ns103062.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
502678.5
ns520213.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6833
ns6000
ns1.14
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6479.5
ns5958
ns1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7041.5
ns6958
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5958
ns5708
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68593
ns69364
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15000
ns15042
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15479
ns15209
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16333
ns16250
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14708.5
ns15083
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
475032.5
ns484888
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3031167
ns3057208.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2061583
ns2066208
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2253209
ns2260437.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4505270.5
ns4508458
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586394
ns589772
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23625708.5
ns23926959
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18333062.5
ns18026875
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17998916.5
ns18022708
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35608125.5
ns35506041.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2764773.5
ns2765084
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33284000
ns33917958
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28078500
ns27599646
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28952938
ns28534208
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41446187.5
ns41643583.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72167
ns74541.5
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
81083
ns74313
ns1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
86562.5
ns74500
ns1.16
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
75479
ns72291
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104806
ns104269
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223458.5
ns317750
ns0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
325166
ns208562.5
ns1.56
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
320958
ns322375
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
210500
ns291583.5
ns0.72
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
552193
ns562266.5
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11917
ns11875
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12583
ns11625
ns1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12708
ns13250
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12083
ns12125
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71752
ns72944
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26667
ns27208
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26583
ns26791.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28000
ns27833.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26500
ns26750
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
476956.5
ns485353
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11667
ns13458.5
ns0.87
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12333
ns12375
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12917
ns13250
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11834
ns12291
ns0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53475
ns54559.5
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25792
ns26417
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25500
ns25959
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26500
ns26209
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26000
ns26000
ns1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
305905.5
ns311166.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181458
ns181458
ns1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
180541
ns179708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184604.5
ns183437.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179667
ns181354
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
57257.5
ns58673.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
592917
ns597521
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
587687.5
ns584083
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
595750
ns583958.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582791.5
ns582625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
291107
ns295518
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8958
ns6125
ns1.46
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6583
ns5958
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8042
ns7333
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6375
ns6166.5
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
71199.5
ns71636.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
13916
ns15312.5
ns0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14875
ns14333
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15459
ns15708
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13958.5
ns13958
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
465947
ns473061
ns0.98
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1219708
ns1205708
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1231750
ns1241125
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1269667
ns1286479
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1009666
ns1000208
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/CUDA
300921
ns301351
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4103750
ns4319770.5
ns0.95
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4571833
ns4471334
ns1.02
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4574959
ns4578416
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3707208
ns3698417
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1038858
ns1037486.5
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1834
ns1916
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1917
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23656
ns24166
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4875
ns4917
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4792
ns4834
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns5083
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
190147.5
ns194650
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5375
ns6583
ns0.82
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5708.5
ns6208
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6917
ns7125
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5437.5
ns5750
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
56411.5
ns56615.5
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10750
ns12209
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11000
ns10895.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
11834
ns11667
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10729.5
ns11000
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
336162
ns336343
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
375
ns292
ns1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
334
ns375
ns0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22819
ns23536
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2750
ns3042
ns0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2750
ns2791
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3042
ns3042
ns1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2792
ns2750
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
159135.5
ns163558.5
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11458
ns12083
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11333
ns11375
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12750
ns14667
ns0.87
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11208
ns11500
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
58102
ns58066.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24750
ns25541
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24334
ns24250
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25084
ns25125
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24750
ns24458
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
298883.5
ns299332
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4209
ns4208
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4209
ns4167
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4291
ns4209
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24823
ns25749
ns0.96
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16084
ns16125
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
15959
ns16166
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16500
ns16291
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16167
ns16250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
197271
ns200089.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5916
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5791
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5916
ns5959
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5833
ns5833
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34115
ns34238
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20500
ns21125
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20417
ns20459
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21250
ns21167
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20708
ns20812.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
178582.5
ns179917
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
423708.5
ns397270.5
ns1.07
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
366416.5
ns384187.5
ns0.95
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
484917
ns478583.5
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103541
ns103333
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67022
ns67557
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
943375
ns891750
ns1.06
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
950687
ns972959
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1197916.5
ns1184041.5
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
330416.5
ns330499.5
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
193979
ns194177
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
80541.5
ns79812.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81125
ns81209
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
81541.5
ns84042
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80479.5
ns79916.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194031
ns194547.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1919833
ns1931812.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1936958
ns1636646
ns1.18
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1930229
ns1918646
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1923250
ns1926062
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
400084
ns403673
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21834
ns22738
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1875
ns1833
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1750
ns1792
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
168563
ns173787
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6416
ns7041
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6166
ns6666
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7667
ns7666
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6709
ns6666
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
61087.5
ns61338.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8959
ns9459
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8875
ns9208
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9250
ns9333
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9312.5
ns9500
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
309875.5
ns310208.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
118672458
ns155906937.5
ns0.76
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
182326458
ns174332958
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148081791.5
ns147872625
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
102035042
ns105277000
ns0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5467326.5
ns5483548
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
610447729.5
ns669282000
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
582022188
ns555382333
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
452913708.5
ns453291791.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
751418979
ns761771979
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34971564
ns35124637
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
646694167
ns699486584
ns0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
688250333
ns668241854.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
583281666.5
ns612942458.5
ns0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
744581417
ns744149959
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59000
ns56292
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
37792
ns47709
ns0.79
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47750
ns47584
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83417
ns83167
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38231
ns37949
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1925854
ns1925646.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1987562.5
ns1981729
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1779021
ns1969458.5
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1864125
ns1898709
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
175192.5
ns177394.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
292250
ns269708.5
ns1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268916
ns268625
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
269500
ns287000
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
266000
ns267041
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
128884
ns125253
ns1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
686771
ns681916.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
702187.5
ns693791
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
591083
ns682500
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
688958
ns685958
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
706872
ns675851.5
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2268958
ns2214458
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2245875
ns2234583
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2101125
ns2206291
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2176375
ns2191792
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133295.5
ns134149
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5521229.5
ns5560291
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5587167
ns5498375
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5520666.5
ns5509646
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5493834
ns5459417
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
748599
ns719852
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
642084
ns658625
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
648917
ns643125
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
636667
ns639042
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
635875
ns637666
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46696
ns47328
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1822625
ns1793125
ns1.02
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1670333
ns1725000
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1719875
ns1723687.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2097416.5
ns2098895.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
221082
ns225375
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57833
ns56875
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38500
ns47416
ns0.81
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46250
ns47125
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82750
ns83625
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28653
ns29103
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020167
ns2036833
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2105417
ns2094667
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2093958
ns2075625
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1999958.5
ns2003333
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
190261
ns192100
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13356563
ns13402000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12441584
ns12431542
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12535208
ns12506125
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15154375
ns14837542
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
512188.5
ns516101
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47248458
ns47711000
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
42098688
ns42011395.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40986395.5
ns40917708
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58394208
ns58129729.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2891115
ns2890593.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74033603.5
ns97106625
ns0.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68368417
ns68523125
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90690875
ns90562125
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76143146
ns76819625
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58250
ns57208
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38583
ns47750
ns0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47625
ns47250
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
79125
ns82041
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47024
ns47330
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1918250
ns1935667
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1983396
ns1983791
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1965584
ns1973041.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1830750
ns1878417
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
192100.5
ns195219
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns417
ns0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
334
ns334
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32257
ns32542
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns6750
ns0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6000
ns6125
ns0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6416
ns6625
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6104.5
ns6375
ns0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
172267
ns170591.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns334
ns0.75
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
31372
ns32528
ns0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2625
ns2958
ns0.89
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2625
ns2667
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2666
ns2625
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
158332
ns158771.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
283213208
ns321043354.5
ns0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
347751604
ns340532834
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314361479.5
ns314151312.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
273430250
ns270601541
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7090888
ns7107105.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
992205416
ns1046677708.5
ns0.95
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
964468250
ns945289167
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
838327667
ns840954313
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1152689375
ns1155312792
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34106482
ns34104665
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1303968312.5
ns1718615541
ns0.76
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1327504666.5
ns1335253333.5
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1629886334
ns1620256500
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1314925417
ns1333409458.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1455709
ns1460479.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1463125
ns1422584
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1415166.5
ns1418083.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1410000
ns1412208.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
127607
ns127814.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5015979
ns5051916
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5060792
ns5033458.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5051500
ns5025999.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5009458
ns5025125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
574399.5
ns500081
ns1.15
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
170351312
ns162840083
ns1.05
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
167663375
ns128019708.5
ns1.31
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
130848583.5
ns130269666
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
167905166.5
ns152687687.5
ns1.10
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4881672
ns4884899
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
618588292
ns844540708
ns0.73
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
577882000
ns537349833
ns1.08
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
497505667
ns560583292
ns0.89
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
647917125
ns649437458
ns1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16266169
ns17863022
ns0.91
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8910542
ns9095833.5
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9026291.5
ns8979250
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7927084
ns7868500
ns1.01
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9711125
ns9713958
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1592738
ns1593097
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
35730646
ns37599479
ns0.95
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
38522375
ns37114520.5
ns1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33553041
ns33537625
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37755625
ns37598895.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6512589
ns6454775
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47333
ns47417
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47333
ns47500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47334
ns47666
ns0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47875
ns47375
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18035
ns18487
ns0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
52792
ns50417
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50292
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50458
ns50500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50667
ns50333.5
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
197012
ns161534
ns1.22
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6375
ns7854.5
ns0.81
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6250
ns6770.5
ns0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7417
ns7729.5
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6750
ns7083
ns0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
112280
ns73765
ns1.52
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9584
ns10209
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9458
ns10375
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10125
ns10042
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10209
ns10000
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
615930.5
ns437389
ns1.41
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5416
ns6875
ns0.79
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5791
ns6458
ns0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7146
ns8250
ns0.87
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5959
ns5583.5
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
123840
ns81756
ns1.51
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12583
ns13791
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12750
ns13500
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13208
ns13541
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12708
ns12895.5
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
529723.5
ns408231
ns1.30
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
1083
ns1125
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1083
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32491
ns32689
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8000
ns8500
ns0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7750
ns7667
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8209
ns8000
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7959
ns8042
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
209838
ns192936.5
ns1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23417
ns23459
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23041
ns23208
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23584
ns23375
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23417
ns23542
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18029
ns18259
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
54667
ns52750
ns1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52417
ns52625
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52667
ns52791.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52458
ns52292
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
299710
ns228166
ns1.31
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1444833
ns1407187.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1449584
ns1444833
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1399209
ns1405083
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1396958.5
ns1396895.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
195765
ns196465
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5000042
ns5040708
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5049833
ns5018541
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5044562
ns5002417
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5015291.5
ns5013625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
612366.5
ns546168
ns1.12
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3043104
ns3079083
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2098583
ns2047000
ns1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2313209
ns2294458.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4606709
ns4540917
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
580804.5
ns582581
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24374458
ns24731020.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19110937.5
ns18912562.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18926833
ns19038249.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36250750
ns36828979.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2861963.5
ns2836262.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33972875
ns34546958.5
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28642167
ns28342834
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28092229
ns28021500.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41633541.5
ns41446459
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
141888875
ns144151542
ns0.98
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
146034209
ns148019541
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
126705062.5
ns125949729
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173781771
ns173005021
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22552094
ns22565027
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1227732750
ns948587416.5
ns1.29
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
839227916.5
ns1316893208.5
ns0.64
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
739276458
ns846166625
ns0.87
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
683957250
ns681952500
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
117875105
ns118678990
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73084
ns76499.5
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74479
ns80646
ns0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75750
ns75541.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74958
ns72583
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
240665.5
ns219501.5
ns1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
280208.5
ns295875
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
288959
ns203584
ns1.42
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
193791
ns292875
ns0.66
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
192583
ns288125
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1331151
ns1030687
ns1.29
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35557542
ns36242145.5
ns0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36592625
ns36566979.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32410750
ns32367458.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40376458
ns40164416.5
ns1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5838475
ns5846818
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148073500
ns152632458
ns0.97
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
158619999.5
ns152676896
ns1.04
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
139542333.5
ns139286062.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
282659625
ns283773000
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34873454
ns34916870
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120976041.5
ns156722375.5
ns0.77
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
182674416.5
ns173916792
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
147566209
ns148066500
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105641958.5
ns102175416
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5456587
ns5486669
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
471084687.5
ns519305021
ns0.91
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
489605103.5
ns467283583
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
432706750
ns441689083
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
737367000
ns742430042
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32284178
ns32276395
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
707739104.5
ns688401084
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
677702687.5
ns657912104.5
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
572041062.5
ns573100917
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
735458208
ns731550292
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1303791.5
ns1195458.5
ns1.09
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
778750
ns988250
ns0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
904854
ns987583
ns0.92
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1945625
ns2066875
ns0.94
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
581135.5
ns585359
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2961271
ns2919770.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2515584
ns2614875
ns0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2624334
ns2611792
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3695417
ns3691417
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1838423
ns1640515
ns1.12
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5788229.5
ns5907500
ns0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5903625
ns5785541
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5805354.5
ns5799666
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2899667
ns2887792
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7167
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5250
ns6125
ns0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6167
ns6167
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
ns10000
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25653
ns25666.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212479.5
ns213834
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
226833
ns221083
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220417
ns220958
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206167
ns209500
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
275653
ns224627
ns1.23
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
307447667
ns310292438
ns0.99
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
279760625
ns228430666
ns1.22
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
198268687.5
ns199615625
ns0.99
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
308090500
ns310121208
ns0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7673335
ns7680035.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1074946146
ns1101205687.5
ns0.98
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
1069981500
ns904614354
ns1.18
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
801953875
ns806439375
ns0.99
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1147606167
ns1160007708.5
ns0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26674789
ns26999631
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4958
ns6458
ns0.77
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5208
ns6041
ns0.86
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5958
ns6083
ns0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5042
ns4895.5
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
169081.5
ns119636.5
ns1.41
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6833
ns7833
ns0.87
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6917
ns7292
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7500
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7125
ns7166
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
666084
ns510149.5
ns1.31
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns666
ns0.88
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns708
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24582
ns24235
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9125
ns9625
ns0.95
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
8459
ns9583
ns0.88
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9084
ns9291
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9041
ns9041
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
231180
ns191615
ns1.21
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
352416.5
ns352542
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351792
ns351541.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
354500
ns354208
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
352125
ns352041
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21300.5
ns21082
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
814416
ns827625
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
809021
ns774417
ns1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
782042
ns830187
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
827334
ns822209
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
305499.5
ns224458.5
ns1.36
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
336479.5
ns315667
ns1.07
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
321125
ns337708
ns0.95
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
450500
ns448542
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10542
ns11375
ns0.93
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18195
ns18423
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
721208
ns705604.5
ns1.02
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
733229
ns738958.5
ns0.99
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1007271
ns999000
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26666
ns26459
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
274145
ns211965.5
ns1.29
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
383062
ns360167
ns1.06
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
329312
ns346666
ns0.95
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
442417
ns437417
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30792
ns30125
ns1.02
batchedmm(16, Bsize=128)/forward/GPU/CUDA
22813
ns22977
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
737625
ns727167
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
785604
ns782250
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1032042
ns1026667
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
105375
ns90000
ns1.17
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
222871.5
ns196309.5
ns1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3708
ns3625
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3417
ns3541
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3666
ns3625
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3583
ns3458
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17737
ns18016
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4417
ns4541
ns0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4209
ns4375
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4333
ns4292
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4292
ns4250
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
278790
ns210663.5
ns1.32
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3791
ns3750
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3604.5
ns3625
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4145.5
ns4500
ns0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3666.5
ns3708
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
207112
ns158953
ns1.30
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8125
ns8750
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8167
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8542
ns8875
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8458
ns8375
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1220818
ns976072
ns1.25
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203687.5
ns203542
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
210041
ns211375
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210625
ns212125
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
200708
ns200042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34937
ns35273
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
645270.5
ns649750
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
631770.5
ns622083
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
622458
ns673000
ns0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630750
ns628584
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
343085
ns286304.5
ns1.20
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1001750
ns1006541.5
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1034729
ns1012562.5
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
956333
ns950084
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
879958
ns867374.5
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA
207672.5
ns208692
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4524208
ns4662333
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4821708
ns4724042
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4482250
ns4460291
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5132979
ns5133479.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
922465
ns931046
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3666
ns3750
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3292
ns3416
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
3417
ns3875
ns0.88
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3583
ns3000
ns1.19
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
232276
ns160179
ns1.45
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7292
ns7708
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6792
ns7000
ns0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns7500
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6875
ns6917
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1014308
ns834512
ns1.22
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1651708
ns1638021
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1164875
ns1178750.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1344708
ns1368583
ns0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2500875
ns2435458
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214937
ns212757
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12379084
ns12417125
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9615125.5
ns9573771
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9247041
ns9272896
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18054792
ns18032250
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1946109
ns1947684.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17413000
ns17407875.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14415146.5
ns14413792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14339250
ns14355521
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21151646
ns21131291.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134917
ns89791
ns1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
88958
ns90333
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
91334
ns91667
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
87666
ns88604
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126488
ns125843
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2026792
ns2042500
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2043625
ns2024209
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1766792
ns2017334
ns0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026459
ns2030458
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1034650
ns851622
ns1.21
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
2770.5
ns1500
ns1.85
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1334
ns2250
ns0.59
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3208
ns3833
ns0.84
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
3791
ns2250
ns1.68
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16389
ns15376
ns1.07
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2584
ns2916
ns0.89
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2459
ns2459
ns1
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2709
ns2791
ns0.97
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2791
ns2917
ns0.96
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
192723.5
ns153882
ns1.25
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7250
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5208
ns6000
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5959
ns6000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9959
ns10000
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34193
ns33856.5
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225250
ns221791
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227063
ns220646
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220708
ns220479.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213333
ns241958.5
ns0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
312634.5
ns266253.5
ns1.17
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3750
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3709
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22321
ns22475
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14417
ns14209
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14250
ns14333
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14416
ns14459
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14375
ns14417
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
475484
ns372905
ns1.28
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134292
ns96208
ns1.40
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
93667
ns95604
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
94354.5
ns96583.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
91958
ns91812.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125921
ns125359
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924541.5
ns1942458
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1939333
ns1923146
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1709625
ns1909167
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1925042
ns1932625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
949226.5
ns780596.5
ns1.22
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
874708
ns859584
ns1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
796250
ns815917
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1220958
ns1209375
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
963208
ns960270.5
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
277966
ns271785
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2838542
ns2844229
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2538917
ns2490542
ns1.02
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3341125
ns3348000.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3415500
ns3404749.5
ns1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1590492.5
ns1487247
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17646
ns17416.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16500
ns17416
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18042
ns18333
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17333
ns17604.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142389.5
ns140524.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
226250
ns261583
ns0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
239208.5
ns215667
ns1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215666.5
ns257416.5
ns0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227708
ns215792
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
648593.5
ns572039
ns1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222666
ns222625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
220083
ns222062.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
222792
ns222146
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221875
ns220833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
275688.5
ns232890
ns1.18
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
564542
ns507750
ns1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
507292
ns501667
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
506333
ns556500
ns0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
559542
ns507875
ns1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1323540.5
ns1207816
ns1.10
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4229.5
ns4292
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
3958
ns4042
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
3916
ns4541.5
ns0.86
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4333
ns3875
ns1.12
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16749
ns16753
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7187
ns7625
ns0.94
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
6917
ns7125
ns0.97
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7292
ns7167
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7416
ns7270.5
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
193558
ns176857.5
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19333.5
ns18542
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17167
ns16958
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19291
ns19792
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
16959
ns16562.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145420.5
ns145193.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223917
ns224708
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216437.5
ns211854
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
215375
ns238145.5
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213812.5
ns212042
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
914033
ns888620
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4958
ns4917
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4250
ns4208
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4417
ns5042
ns0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
3917
ns3667
ns1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
206416
ns184696.5
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10250
ns10875
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10000
ns10584
ns0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10958
ns11042
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10000
ns10458
ns0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1027488.5
ns966049
ns1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3833
ns3645.5
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3459
ns3209
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
3416
ns3792
ns0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3250
ns2833
ns1.15
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
236791.5
ns188943.5
ns1.25
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7417
ns8000
ns0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns7125
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7958
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7375
ns7208
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1067899
ns1007792.5
ns1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23463750.5
ns24183291.5
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43484791.5
ns34946479
ns1.24
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37835875
ns37338083
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34880875
ns34888125
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1833754
ns1782868.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184463792
ns186454375
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
172964124.5
ns159896583
ns1.08
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146554521
ns145990104.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
410369375
ns411376042
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16525549
ns16457564
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
424815979
ns432652834
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
259769792
ns247809833
ns1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
297288958
ns279749334
ns1.06
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
478383791
ns479974375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183959
ns183958.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183375
ns182375
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
186187.5
ns185500
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183187.5
ns184062.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
205888.5
ns172057.5
ns1.20
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
602916.5
ns637709
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
596416.5
ns586041.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
592375
ns639084
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
596542
ns596416
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1054788
ns1002959
ns1.05
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3829562.5
ns4026750
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3998791.5
ns3920250
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3564812.5
ns3579209
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4550791.5
ns4570291.5
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532059.5
ns532647
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17302667
ns17895041.5
ns0.97
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18565313
ns17836083
ns1.04
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16600312.5
ns16489292
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20208979.5
ns20147270.5
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2631431
ns2607011.5
ns1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
33095
ns32522
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9083
ns9729
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9042
ns9291
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9458.5
ns9625
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9125
ns9209
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
266296
ns258262.5
ns1.03
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
498097750
ns503041917
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
506743916
ns424847437.5
ns1.19
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
424015542
ns425274250
ns1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
594637416
ns682175395.5
ns0.87
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12483759
ns12478951
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1878936437.5
ns1889075833
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1662067875
ns1625727875
ns1.02
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1496755770.5
ns1494457604.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2214230167
ns2214128083.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49527395
ns49385566.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1663166
ns1647625
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1177833
ns1201312.5
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1370041
ns1376271
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2349521
ns2354042
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
217522
ns214603
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12726750
ns12810437.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
10036417
ns9968417
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9643083
ns9702395.5
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18397833
ns18320249.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2037123
ns2015837.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17723584
ns17772083
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14827916
ns14741771
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14555416.5
ns14583292
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21415041
ns21392208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26209
ns26167
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23706
ns23824
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67354.5
ns67125
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66792
ns67500
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
68375
ns67958
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66875
ns67000
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
393355.5
ns377030.5
ns1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203458
ns204125
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209417
ns209792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
210084
ns210750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199125
ns200292
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26245.5
ns26462
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
647916
ns650250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
672375.5
ns625708.5
ns1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
621792
ns669874.5
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
593542
ns629250
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
351878.5
ns303651
ns1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
679750
ns627292
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
657291
ns671583
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
595709
ns598312
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
632771
ns639791
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131601.5
ns132031
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2238750
ns2336625
ns0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2300791
ns2255375
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2241896
ns2235562.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2244958
ns2236583
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1242570.5
ns1129126
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18625
ns18437
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17979
ns18354.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18375
ns20062.5
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17104
ns17104.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
144244
ns144037
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
256458
ns265000
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
245646
ns230729
ns1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221750
ns231875
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
230416
ns258875
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1056298
ns929000
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns708
ns0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns666
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23741
ns23483
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9208
ns10125
ns0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9708
ns9541
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9458
ns10208
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9333
ns9291
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
257592.5
ns253535
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5125
ns5958
ns0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5500
ns5417
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6395.5
ns6166
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5458
ns4959
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
231821.5
ns177041
ns1.31
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6833
ns7770.5
ns0.88
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6792
ns7250
ns0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7458
ns7875
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6917
ns6917
ns1
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
801589.5
ns724899.5
ns1.11
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2167
ns2334
ns0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2000
ns2042
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2208
ns2208
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2375
ns2292
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
17797
ns17786
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6375
ns6667
ns0.96
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6542
ns6958
ns0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6667
ns6625
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6375
ns6500
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
330267.5
ns316064.5
ns1.04
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
748708
ns752459
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
756208
ns746750
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
752750
ns750791
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
753542
ns746917
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
20724
ns21186
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
792417
ns794041.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
796875
ns787583
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
786834
ns810166.5
ns0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
808000
ns777749.5
ns1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
297689.5
ns292715.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7125
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5250
ns6000
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns6042
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10125
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33074
ns33031.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228604.5
ns260583
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
251041
ns266771
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227708
ns240125
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226000
ns213791
ns1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
362298.5
ns347920
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10209
ns10417
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10209
ns10083
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10458
ns10666
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9750
ns9770.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
252317
ns236152.5
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25334
ns24958
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24312.5
ns24125
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25959
ns25625
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24395.5
ns24625
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1133104
ns1075060
ns1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106928354
ns106687708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
126898666
ns118577083.5
ns1.07
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
121692334
ns120497312.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117598792
ns118064771
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2629460
ns2612121
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
390743083
ns394040917
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
379904750
ns367160584
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
361277959
ns357048666
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
481946125
ns483172291
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15184946
ns15226002.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
754771020.5
ns944093812.5
ns0.80
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
597861750
ns581088583
ns1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
748681771
ns744439291.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
760209125
ns770449312.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6500
ns7833.5
ns0.83
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6667
ns6584
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8333
ns7667
ns1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6667
ns6584
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
239111
ns231298
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14125
ns14833.5
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14125
ns13833
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14437.5
ns14333
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13667
ns13667
ns1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1073718
ns1030746
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5542
ns6812.5
ns0.81
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5542
ns6250
ns0.89
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
6395.5
ns8625
ns0.74
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5792
ns5458
ns1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
235877.5
ns228035.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12208
ns13541
ns0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12542
ns12250
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12750
ns13417
ns0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12166
ns12375
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
781667
ns749909.5
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5709
ns5562.5
ns1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5437.5
ns5917
ns0.92
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
5750
ns5959
ns0.96
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5833
ns5625
ns1.04
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16760
ns17374
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15417
ns15979.5
ns0.96
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15333
ns15291
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15500
ns15666.5
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15625
ns15791
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
199275.5
ns198865.5
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns416
ns0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns417
ns0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
333
ns334
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23515
ns23594
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6333
ns6770.5
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6167
ns6416
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6417
ns6875
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6333
ns6416
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
240257
ns238325.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5833
ns5958
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5875
ns5917
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6083
ns5958
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
24789
ns24848
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
20958
ns22291.5
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
20958.5
ns21375
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21334
ns21750
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21000
ns21833
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
263523
ns262151
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
188417
ns145041
ns1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
162166
ns179792
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146708.5
ns147000
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
149625
ns145833
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167166
ns167939
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1323812.5
ns1367292
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1371958
ns1334375
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1317937.5
ns1330499.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1325562.5
ns1319209
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1350174
ns1299116
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25292
ns23000
ns1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22500
ns24062.5
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23146.5
ns24875
ns0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22979.5
ns21542
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
352259
ns285873
ns1.23
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
173645.5
ns181458
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
180041
ns142020.5
ns1.27
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
119500
ns130312
ns0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
126334
ns166291
ns0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1470411
ns1432985
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23380
ns24013
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6125
ns6667
ns0.92
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6229.5
ns6292
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6708
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6208
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
256300
ns257668.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5084
ns4875
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5083
ns4541
ns1.12
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5083
ns4917
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4292
ns4334
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
256465.5
ns248170.5
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10209
ns10541
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9750
ns9500
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10750
ns10583
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10000
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1354750
ns1315251.5
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1667
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1583
ns1584
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1708
ns1625
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22916
ns23770.5
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5750
ns6083
ns0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5667
ns5625
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6167
ns6000
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5750
ns5625
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
272343
ns277301
ns0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6820375
ns6853687
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6368417
ns6416292
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6567000
ns6504750
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7648166
ns7620312.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214879
ns214867.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24083333.5
ns24153125
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21351687.5
ns21320542
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21140875
ns21047708.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29752125.5
ns29760542
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2100360
ns2095640.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37299645.5
ns48863062.5
ns0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34217771
ns34327709
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45700125
ns45697437.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38021000
ns38239917
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5750
ns6708
ns0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5583.5
ns5666
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6395.5
ns6459
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5292
ns5770.5
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
235350
ns232386
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8167
ns9062.5
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8416.5
ns8375
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8542
ns8375
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8291
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1060836
ns1027676.5
ns1.03
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1566292
ns1539229.5
ns1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1237250
ns1264500
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1619208
ns1616916
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2132958
ns2152625
ns0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA
278998
ns281859
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7937625
ns7990000
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6656917
ns6612375
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7130604.5
ns7167458
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10453333.5
ns10472916.5
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1878437
ns1870517
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
370292
ns359666
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
353124.5
ns372896
ns0.95
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
459083
ns456458
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
23666
ns22396
ns1.06
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42541.5
ns47625
ns0.89
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
753083
ns739666
ns1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
809125
ns822937.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1063125
ns1053333
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
116979.5
ns109291
ns1.07
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
239130.5
ns240230
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397291
ns396792
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
212417
ns288042
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288125
ns287917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
752000
ns755250
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44180
ns45350
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
667583
ns639083
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
474167
ns531000
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531812.5
ns531625
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
973083
ns973083
ns1
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
194058
ns194303
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
678250
ns636645.5
ns1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
667145.5
ns636021
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
621709
ns652063
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
646959
ns654042
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133035
ns133147
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2484229
ns2499458
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2543916.5
ns2456708
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2480312.5
ns2459542
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2471875
ns2452854
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1215811
ns1214588
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
2791
ns2209
ns1.26
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2084
ns3041
ns0.69
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4333
ns4667
ns0.93
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
3354
ns2792
ns1.20
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16281.5
ns16731
ns0.97
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5375
ns5625
ns0.96
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5209
ns5333
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5500
ns5625
ns0.98
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5584
ns5584
ns1
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
201076.5
ns199833.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1457583
ns1461916.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1497084
ns1505708
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1498833
ns1503458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1436500
ns1437083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41204
ns41276
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5117834
ns5154479
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5304542
ns5307146
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5300500
ns5288209
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4807333
ns5001917
ns0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
199725
ns200453
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3750
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3708
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32858
ns34571
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15250
ns15250
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15000
ns15250
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15292
ns15375
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15083
ns15125
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
377713
ns372573
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
70792
ns71375
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71417
ns71583
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71125
ns71208
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
70000
ns71250
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113374.5
ns114012
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
318333
ns325917
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
334916
ns325167
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
318083
ns318375
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
318209
ns317750
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
193117.5
ns199225
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1083
ns0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1084
ns1083
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
959
ns1000
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23866.5
ns24050
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7833
ns8584
ns0.91
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7875
ns8084
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8375
ns0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7875
ns8000
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
261797
ns262017.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
512646
ns497584
ns1.03
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
479541
ns490208.5
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
566104
ns559959
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
216667
ns148209
ns1.46
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130101
ns129838.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1405541
ns1405375
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1481750
ns1471875
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1758666
ns1758791.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
872625
ns869583
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274250.5
ns274551
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns417
ns0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
417
ns375
ns1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns333
ns0.88
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31596
ns32490
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6875
ns0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
5854.5
ns6208
ns0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6500
ns6541
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6042
ns6333
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
263141.5
ns265808
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1731916.5
ns1723271
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1768000
ns1751146
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1725583
ns1734270.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1724459
ns1724083
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168363
ns169537.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4401542
ns4419270.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4406313
ns4365292
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4361083
ns4351792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4360083
ns4357792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1173884.5
ns1171701
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6583
ns6833.5
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6791
ns7062.5
ns0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7062.5
ns7833
ns0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6791
ns6833
ns0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20597
ns20938
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
32792
ns72249.5
ns0.45
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
62083
ns51291.5
ns1.21
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33292
ns52875
ns0.63
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51084
ns51333
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
293465.5
ns211685.5
ns1.39
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
18000
ns17709
ns1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17458
ns18333
ns0.95
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
17916
ns18312.5
ns0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
18042
ns17625
ns1.02
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18220
ns18852
ns0.97
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53250
ns53583
ns0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53292
ns52958
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53583
ns53625
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53416.5
ns53417
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
340467.5
ns337333.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75333
ns75417
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75417
ns75375
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75292
ns75334
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74833
ns75333
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46370
ns47609
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
324292
ns339875
ns0.95
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
342291.5
ns332958
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
336708
ns325791
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
324667
ns324042
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
208689
ns215842
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1483500
ns1486125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1520542
ns1530958
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1528333
ns1527584
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1461958
ns1463416
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51330
ns52815
ns0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5116916.5
ns5149209
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5306417
ns5312291.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4956417
ns5298250
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4985125.5
ns4995000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
204511
ns207728
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28292
ns28291
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28167
ns28292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24159
ns24971.5
ns0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66584
ns66292
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66208
ns66375
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67583
ns66209
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66208
ns66500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
518001
ns510271
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1500667
ns1349333.5
ns1.11
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
935916
ns1135833
ns0.82
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1063395.5
ns1132458
ns0.94
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2253583
ns2196062.5
ns1.03
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
585024
ns589889
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3089125
ns3042333
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2661333
ns2731792
ns0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2581104
ns2726167
ns0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3818625
ns3811625
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1992242
ns2004374
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7906625
ns8038292
ns0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8031000
ns7942499.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7927541.5
ns7931979.5
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4820333
ns4817250
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
134041
ns80499.5
ns1.67
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
81459
ns82250
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
82833
ns82500
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81833
ns80479.5
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194356
ns194209
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2010167
ns2050042
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2043167
ns2034333.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2009750
ns2017875
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2026792
ns2018854
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
794414
ns768336
ns1.03
This comment was automatically generated by workflow using github-action-benchmark.