-
Notifications
You must be signed in to change notification settings - Fork 63
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
11 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fb901ea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4125
ns3875
ns1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4083.5
ns4208
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5167
ns5250
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4250
ns4333
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60836
ns61892.5
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10458
ns10542
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10208.5
ns10209
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10459
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10292
ns10417
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
426426
ns433097
ns0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1000
ns1084
ns0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1291
ns1291
ns1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1437.5
ns1292
ns1.11
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1208
ns1209
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
17928
ns18531
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4125
ns4167
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4084
ns3917
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4167
ns4250
ns0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
3958
ns4083
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
109688.5
ns111975
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57625
ns57583
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38333
ns46292
ns0.83
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46792
ns38042
ns1.23
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81167
ns83125
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37191
ns37370
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2025916.5
ns2031625
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2084833.5
ns2085958
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2091333
ns2088333.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993604
ns2005041
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194623
ns198108
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144416
ns143750
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147520.5
ns146063
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
144062.5
ns145209
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
144041
ns144583.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
165620
ns166112.5
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1116375.5
ns1118042
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1135458
ns1114250
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1116021
ns1153000
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1117250
ns1068770.5
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
525200
ns533468
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3583
ns3584
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3416
ns3750
ns0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4417
ns4417
ns1
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3750
ns3958
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
67680
ns72081
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9083
ns9000
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9042
ns8542
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9291
ns9041
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns8916
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
488913
ns503190.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16583.5
ns15000
ns1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15000
ns15250
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
16937.5
ns16708
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
14521
ns15542
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
55104
ns55903
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
215166.5
ns214187.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213375
ns213604.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212833
ns215395.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213208
ns212917
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272083
ns278881
ns0.98
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
542
ns500
ns1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns542
ns1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
687.5
ns750
ns0.92
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns583
ns0.93
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17338
ns17733
ns0.98
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1583
ns1625
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1666
ns1500
ns1.11
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1708
ns1625
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1625
ns1583
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
102756.5
ns105125.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7083
ns7250
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5292
ns5833
ns0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5875
ns5250
ns1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10084
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23408
ns24106
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
221750
ns220750
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
231917
ns228084
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228875
ns230459
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
214167
ns213708.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
169815.5
ns169707.5
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3917
ns3875
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23411
ns23637
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16583.5
ns16708
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16459
ns16834
ns0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16709
ns16875
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16791
ns16625
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
162393
ns161602
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
569208
ns578416.5
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
569667
ns569958
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
570125
ns579292
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
578750
ns578291
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113197
ns113009
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1418708
ns1417979.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1421583
ns1419167
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1420834
ns1424875
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1432291
ns1426416
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
211123.5
ns210883
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1076625
ns1067000
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
938625
ns958417
ns0.98
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1353166
ns1336917
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1298500
ns1304396
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
277930.5
ns271759
ns1.02
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5845333
ns5795104.5
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4593146
ns4601125
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4960354
ns4929084
ns1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5524145.5
ns5750083
ns0.96
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1090079
ns1068932
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns542
ns0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23601.5
ns23274
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2166
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2209
ns2167
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2083
ns2208
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
169946.5
ns171283
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
3666
ns4333
ns0.85
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4417
ns4125
ns1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4709
ns5083
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4500
ns4292
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
65407
ns66130
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10834
ns11625
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11292
ns11458
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11667
ns12458
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10958
ns11709
ns0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
453534
ns452684.5
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6167
ns6375
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7479.5
ns6959
ns1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8500
ns8229.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6375
ns6916
ns0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52550.5
ns52019
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16583
ns16875
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17500
ns17000
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19833
ns18166
ns1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16625
ns17542
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
303262
ns301500.5
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns542
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
667
ns542
ns1.23
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
625
ns666
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns667
ns0.81
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
31843
ns32512
ns0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8542
ns8500
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8875
ns8750
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9250
ns9500
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8208
ns8959
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
159642
ns157915
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64792
ns64542
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64542
ns64625
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64542
ns64750
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64375
ns64875
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111120
ns111658.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
280042
ns279708
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
291791
ns283750
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
279250
ns293250
ns0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
277208
ns284521
ns0.97
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
184735.5
ns185586.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3278875
ns3282500
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
2813375
ns3076875
ns0.91
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
3029687.5
ns2795834
ns1.08
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
3938209
ns4063541.5
ns0.97
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
578907.5
ns567714
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7620083
ns7638583
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7352417
ns7366000
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7457271
ns7289042
ns1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8189500
ns8172916
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1328385
ns1335450
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
17561125
ns17555833
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
17648625
ns17413291.5
ns1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
17534459
ns17640417
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
14095167
ns14085667
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23588417
ns23644667
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
44459541
ns33391375
ns1.33
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37064416.5
ns40912708
ns0.91
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34977333.5
ns35048479
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1845684
ns1855237.5
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189659041
ns189754584
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
250146875
ns232353000
ns1.08
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
193409375
ns201284750
ns0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
434181959
ns435226125
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
18049039.5
ns13860033
ns1.30
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
290672125
ns290571042
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
356317062.5
ns334832916
ns1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296289666.5
ns303703583
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
392800437.5
ns393811604
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
22875
ns21541
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22938
ns22375
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24562.5
ns23354
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24416
ns24500
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96194.5
ns95582
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103875
ns103250
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
103416
ns115312.5
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104292
ns104625
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
103125.5
ns102667
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
506291.5
ns503695.5
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5917
ns5750
ns1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
6000
ns5791
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
6584
ns7666
ns0.86
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6209
ns6250
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68552.5
ns68642
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15166.5
ns14875
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15500
ns14625
ns1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15542
ns16250
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14958
ns14833
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
480464
ns478112.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
2996875
ns3019792
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2072750
ns2069896
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2257667
ns2279000
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4838583
ns4750917
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
584192
ns583001
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23549437
ns23604770.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18342167
ns18003875
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
17896791
ns18293125
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35570625
ns35919729.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2764116
ns3106744
ns0.89
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33587937.5
ns33297687
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28029333
ns27474958
ns1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28377209
ns29070229.5
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41334187.5
ns41830959
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75479
ns73396
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73958.5
ns75125
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
74125
ns74875
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72166
ns72959
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
104339
ns103514
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
203458.5
ns274208
ns0.74
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
280916.5
ns205959
ns1.36
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
209583
ns255333
ns0.82
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
216291.5
ns296916
ns0.73
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
562778.5
ns554316
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11708
ns11167
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12833
ns11875
ns1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13042
ns13458
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11917
ns12458
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
72705
ns72256.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26645.5
ns26583.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26458
ns26833
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27458
ns28084
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26792
ns26708
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
488247
ns483481.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12000
ns11520.5
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13750
ns13041
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14000
ns13750
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
12500
ns12875
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
55166
ns52959.5
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
25583
ns25500
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26416
ns25542
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26375
ns26375
ns1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
28167
ns26542
ns1.06
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
313572.5
ns310926
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
181541.5
ns179125
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
181104
ns182625
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
181895.5
ns183958
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
181916
ns182416
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
59339.5
ns58111
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
612417
ns582958
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
590459
ns583209
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
583541
ns610042
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
582416
ns582000
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
294347
ns286370
ns1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5854.5
ns5729.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7000
ns6334
ns1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7167
ns7500
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6042
ns6083
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
72861
ns71136.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14208.5
ns14167
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14333
ns14500
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15084
ns15667
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14208
ns14667
ns0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
476457
ns468005
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1198334
ns1186749.5
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1236458
ns1247334
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1270167
ns1282666.5
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1009834
ns841729
ns1.20
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301349
ns301667
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4121104
ns4101771
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4571459
ns4417458
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4583146
ns4790916
ns0.96
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
3708333
ns3731833.5
ns0.99
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1054428
ns1043818
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1875
ns1792
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1875
ns1875
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1875
ns1834
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
24401
ns23460
ns1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4792
ns4875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4875
ns4834
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5042
ns4917
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4875
ns4958
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
192852.5
ns189873
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5916.5
ns5792
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6625
ns6125
ns1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7625
ns7187.5
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5916
ns6208
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
57663
ns55970.5
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10562.5
ns10625
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11417
ns11083
ns1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12083
ns11584
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10459
ns11500
ns0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
339260
ns332298.5
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
375
ns292
ns1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
375
ns375
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
23460
ns22660
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2791
ns2708
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns2750
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2709
ns3000
ns0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2791
ns2709
ns1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
162941.5
ns159360
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11542
ns11292
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12209
ns11792
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13875
ns13250
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11583
ns12229.5
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
59011.5
ns57130.5
ns1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24375
ns24708
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24583
ns24167
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25208
ns25854
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24792
ns24916.5
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
303188
ns300198
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4167
ns4208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4208
ns4125
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4208
ns4250
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4208
ns4208
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
25111
ns24574
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16042
ns16166
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
15917
ns16000
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16291
ns16042
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16291
ns16375
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
202144.5
ns201392
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5875
ns5750
ns1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5833
ns5750
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5916
ns5875
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5750
ns5916
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34056
ns33153
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
20520.5
ns20333
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21000
ns20792
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21167
ns20917
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21333
ns21375
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
179609.5
ns175780
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
425458.5
ns417417
ns1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
364854.5
ns378854.5
ns0.96
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
482520.5
ns487270.5
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
103125
ns103917
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA
67737
ns66399.5
ns1.02
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
906625
ns877583
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
982042
ns949562.5
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1181333
ns1206625
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
377458
ns469167
ns0.80
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
194135
ns191112
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81333
ns85417
ns0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82041
ns81083
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84291
ns84625
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81813
ns85417
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194522
ns193239.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1927625
ns1913750
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1941000
ns1913542
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1930917
ns1943083.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1842062
ns1906896
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
390656
ns406558
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns291
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
333
ns333
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
22388
ns22047.5
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1875
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1834
ns1834
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1875
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
171479
ns171306.5
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6542
ns6209
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
7083.5
ns6625
ns1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
8020.5
ns8542
ns0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6500
ns7125
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60274
ns60422
ns1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8917
ns9000
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9417
ns8958
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9916
ns9584
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9208
ns9416
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
311149
ns313100.5
ns0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120884833.5
ns119013624.5
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
181722750
ns174073709
ns1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148231625
ns154836458
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
108144417
ns106465208
ns1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5478841
ns5473107.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
615355583.5
ns615549000
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
581447666.5
ns555627500
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
451634708.5
ns469486625
ns0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
757933250.5
ns758488604
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34994190
ns34956527
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
649420209
ns650955333
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
687787021
ns665997520.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
584232000.5
ns596311875
ns0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
744942000
ns746344250
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59500
ns59041
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39125
ns47750
ns0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
48020.5
ns39041
ns1.23
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83458
ns84708.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38331
ns36941
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1946625
ns1922166
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1985458
ns1978041
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1983521
ns1990167
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1887334
ns1920167
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
176268
ns173728
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
265750
ns282041.5
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268104.5
ns266458
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
269291.5
ns273853.5
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
265125
ns270333
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
125359
ns135453.5
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
690208
ns674666
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
658417
ns684354
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
603125
ns676145.5
ns0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
594458
ns596375
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
701612
ns752272.5
ns0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2169417
ns2253417
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2237833
ns2217895.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2188625
ns2190479
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2203000
ns2202416.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133751
ns133169
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5513083.5
ns5479500
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5572520.5
ns5506916
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5508208
ns5588312.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5485271
ns5564021
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
720574
ns794371.5
ns0.91
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
638458
ns646958
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
640250
ns656500
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
640416
ns640416
ns1
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
642666.5
ns657291
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46893.5
ns47817
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1824209
ns1822375
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1666417
ns1719708
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1728208
ns1665541
ns1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2102708
ns2108083
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
220656.5
ns227850
ns0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58500
ns58458
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38584
ns45083
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46208
ns38041
ns1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83042
ns84958
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28530.5
ns28842
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2056084
ns2030375
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2102729.5
ns2084312.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2102270.5
ns1787459
ns1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1992792
ns2014583.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
189031.5
ns192397.5
ns0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13396167
ns13382625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12488625
ns12433458.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12567208
ns12571375
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
14924083
ns15143562.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
512412.5
ns514602
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47267416.5
ns47546916
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
42078000
ns41875708
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
40824125
ns41161020.5
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58451854
ns58396167
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2895350
ns3251545
ns0.89
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74360062.5
ns75047125
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
91413375
ns67897459
ns1.35
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90659959
ns90940166.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
76716041
ns99460667
ns0.77
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
59208
ns58750
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
38833
ns46875
ns0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47125
ns38333
ns1.23
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
78625
ns80334
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48139.5
ns46475
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1938145.5
ns1921416
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1984167
ns1976416
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1977812.5
ns1721708.5
ns1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1877083
ns1905000
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
195830.5
ns190253.5
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns333
ns1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns417
ns0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32688
ns31709.5
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6083
ns6125
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6334
ns6208
ns1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6583
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6000
ns6854.5
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
173538
ns176344
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
333
ns291
ns1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32105
ns31144
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2584
ns2625
ns0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2792
ns2625
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2791
ns2833
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2584
ns2750
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
160748.5
ns164923.5
ns0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
287049250
ns285479083.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
347795687.5
ns340672292
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
314367979.5
ns320528833.5
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
271524458
ns267627833
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7120410.5
ns7061953.5
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
1003307875
ns1000752000
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
964885125
ns941508917
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
835293000
ns849741542
ns0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1152976875
ns1162624583
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
34058870
ns33972568.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1312833396
ns1314224145.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1706336084
ns1312834041.5
ns1.30
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1599191959
ns1621294583
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1309056604.5
ns1681368042
ns0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1408791
ns1461562.5
ns0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1452791.5
ns1416958
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1449625
ns1414750
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1407209
ns1412375
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128282.5
ns127713.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5034917
ns5020125
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5065916.5
ns5027042
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5035937.5
ns4740833
ns1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5012729
ns5044042
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
483777.5
ns510137
ns0.95
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
171224875
ns171071812.5
ns1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
167755167
ns126739625
ns1.32
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
128923708
ns146147041
ns0.88
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
154904187
ns168329334
ns0.92
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4889428.5
ns4881506
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
621337542
ns622612209
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
581831583
ns538980667
ns1.08
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
460212833
ns504257334
ns0.91
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
643084792
ns656863250
ns0.98
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16318390
ns16684647
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8919875
ns8964583
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
9050687.5
ns8900333
ns1.02
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
7921583
ns7993333
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
9747084
ns9790312.5
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1600463.5
ns1594468.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
36566209
ns36115750.5
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
38511167
ns36971083.5
ns1.04
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
33595375
ns34444208
ns0.98
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
37796583
ns37794834
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6471792
ns6465190.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47291
ns47292
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
47479.5
ns47542
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47729.5
ns47584
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47334
ns47500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
18559
ns18793
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50416
ns50291.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50417
ns50417
ns1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50417
ns50833
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50375
ns50750
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
167009.5
ns231220
ns0.72
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6459
ns6291
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7770.5
ns7084
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8041
ns7792
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7000
ns7542
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
76373.5
ns106604.5
ns0.72
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns10209
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10458
ns9833
ns1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10270.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10084
ns10459
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
456260
ns619990
ns0.74
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5708
ns5792
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6708
ns6416
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7458
ns7958
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5917
ns6042
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
91945.5
ns121725
ns0.76
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12917
ns13375
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13625
ns13000
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13416
ns13584
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13292
ns13375
ns0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
417439.5
ns528027
ns0.79
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
958
ns1000
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
1042
ns959
ns1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1125
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32442
ns31705
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7542
ns7792
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7875
ns7667
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8291
ns8209
ns1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7834
ns8666
ns0.90
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
192614
ns204125.5
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23250
ns23000
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23250
ns23084
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23416
ns23584
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23292
ns23500
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18706.5
ns18461
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52417
ns52458
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52625
ns52291
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52959
ns52791
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52875
ns52458
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
226057.5
ns286087.5
ns0.79
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1403937.5
ns1397209
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1409291.5
ns1395917
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1405208
ns1400209
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1402896
ns1398500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196688.5
ns195540.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5027625
ns5008458.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5036500.5
ns5018750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5008875
ns4722750
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5003083.5
ns4703042
ns1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
565308
ns626852.5
ns0.90
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3058166
ns3063416
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2060229
ns2063875
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2301833
ns2311417
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4897625
ns4823500
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
586278
ns580360
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24473708.5
ns24332959
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
19098958
ns18875458
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18981042
ns18989334
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
37019125
ns36748479.5
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
2831934
ns3188758
ns0.89
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
34098417
ns34048562.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28724166.5
ns28257854
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28239458
ns28468541.5
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41378063
ns41851021
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
146235958
ns144123292
ns1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
147965500
ns147912291
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
127304667
ns128219729
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
172673353.5
ns175666645.5
ns0.98
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22564119
ns22797470
ns0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1235304437.5
ns1274551333
ns0.97
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
869077229.5
ns1209986250
ns0.72
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
769904041
ns717258459
ns1.07
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
666199333
ns669341542
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118146881
ns118134658
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
73812
ns75042
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73875
ns73833
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
75687.5
ns75813
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76416
ns74125
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
208579
ns248024.5
ns0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
295500
ns202750
ns1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
193958
ns283250
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
287395.5
ns194000
ns1.48
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
282729
ns189583
ns1.49
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1165959
ns1272660.5
ns0.92
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35776083
ns35542000
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
36529041
ns36428479
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32581292
ns32734792
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
40338396
ns40941958
ns0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5849817
ns5852888
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
148302541
ns147574354
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
158881084
ns154842271
ns1.03
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
138956354.5
ns142249771
ns0.98
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
284123584
ns285430916
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34596502
ns34907859
ns0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120211625
ns119543458.5
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
182136458
ns173916625
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
148062084
ns155928584
ns0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
105814875
ns103545938
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5475710.5
ns5470774
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
469150645.5
ns471171395.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
486184250
ns467366000
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
437949792
ns456719729
ns0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
739059333
ns738831458
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
32333012
ns32277660
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
712730687.5
ns709159062
ns1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
678064125
ns654555208.5
ns1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
570651646
ns585803354.5
ns0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
732192500
ns726547959
ns1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1338854
ns1242646
ns1.08
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
764333
ns968625.5
ns0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
971166
ns674709
ns1.44
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
2047291
ns1941770.5
ns1.05
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
582645.5
ns569058
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2995792
ns2969916
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2516000
ns2603708
ns0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2623541.5
ns1985166.5
ns1.32
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3683208
ns3729625
ns0.99
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1752698
ns1762089
ns0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
5821709
ns5801458
ns1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
5892750
ns5780958
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
5806979
ns5645834
ns1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
2887229
ns2921042
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7500
ns7250
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5958
ns0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6042
ns5333
ns1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10041
ns10083
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25775
ns25119
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
225958.5
ns215750
ns1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220750
ns258458
ns0.85
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220625
ns221291.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206167
ns207146
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
259112
ns264756
ns0.98
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
308668791.5
ns308377104
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
282575646
ns231656291
ns1.22
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
199775042
ns224042396
ns0.89
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
309205458
ns307881333
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7688394
ns7678620
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1093080750
ns1097604312.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
1075916375
ns920148521
ns1.17
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
810723875
ns858485833.5
ns0.94
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1146255478.5
ns1150798750
ns1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26478179
ns26497955
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5042
ns4958.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6250
ns5583
ns1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6584
ns6916.5
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5458
ns5541
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
170923.5
ns171524
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333
ns7542
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7416
ns6750
ns1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7417
ns7458
ns0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7041
ns7875
ns0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
648059.5
ns670577.5
ns0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
625
ns541
ns1.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns541
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
541
ns625
ns0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24468
ns23778
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9333
ns8708
ns1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9000
ns8541.5
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9729.5
ns9458
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8792
ns9541.5
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
223281
ns233071
ns0.96
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351708
ns353250
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
352583
ns353208
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
352708
ns352667
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
351416.5
ns352125
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21843
ns21348
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
811563
ns822333
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
793583.5
ns774854
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
812375
ns777042
ns1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
804291
ns825999.5
ns0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
279114.5
ns286748
ns0.97
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
338875
ns336833
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
321459
ns335917
ns0.96
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
450271
ns445708
ns1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
10750
ns10917
ns0.98
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18538
ns17559
ns1.06
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
712021
ns713499.5
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
730333
ns730834
ns1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1002270.5
ns1027167
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
26708
ns26500
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
261073.5
ns260521.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
381875
ns371375
ns1.03
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
326167
ns346250
ns0.94
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
443625
ns445812.5
ns1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
30417
ns30479
ns1.00
batchedmm(16, Bsize=128)/forward/GPU/CUDA
23393
ns22136
ns1.06
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
731937.5
ns734062.5
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
784187.5
ns773750.5
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1027875
ns1061729
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
89584
ns98521
ns0.91
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
220484
ns220018.5
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3375
ns3375
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3708
ns3542
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3833
ns3687.5
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3458
ns3583
ns0.97
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
17892
ns17780
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4292
ns4125
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4250
ns4167
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4333
ns4375
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4417
ns4500
ns0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
288266.5
ns258504
ns1.12
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4083
ns3750
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4062.5
ns3500
ns1.16
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
4334
ns4917
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3833
ns4083
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
243078.5
ns200777
ns1.21
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8417
ns8417
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8208
ns8000
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns8625
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8604.5
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1294141
ns1183716
ns1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203583
ns205708
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209750
ns210125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209750
ns210375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199542
ns200375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35748
ns34375
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
610959
ns650916
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
629979
ns666959
ns0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632042
ns624167
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
624312.5
ns632458
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
366873
ns343648
ns1.07
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
1020270.5
ns1000479
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1019375
ns1007958
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
956541
ns974396
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
862917
ns894770.5
ns0.96
batchedmm(128, Bsize=128)/forward/GPU/CUDA
208035
ns207021.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4555583
ns4512146
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4847250
ns4708729.5
ns1.03
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4461541
ns4609875
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
5174375
ns5171208.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
927061
ns947853.5
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4042
ns3333
ns1.21
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3500
ns3083
ns1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
4250
ns4333
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3375
ns3917
ns0.86
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
241039.5
ns218377.5
ns1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7375
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7062.5
ns6833
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7333
ns7458
ns0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6916
ns7459
ns0.93
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1063926.5
ns1012916
ns1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1524958
ns1641584
ns0.93
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1178854.5
ns1193979
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1368709
ns1342687.5
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2362167
ns2486625.5
ns0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
218600.5
ns214048
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12347875
ns12366291.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9603708
ns9556958
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9285208.5
ns9332500
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17994500
ns18065166.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1959865.5
ns1946882
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17343125
ns17346750
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14424146
ns14347000
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14365583
ns14486917
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21176708
ns21148167
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
90520.5
ns134750
ns0.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
90208
ns88584
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
94500
ns92042
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
133292
ns89042
ns1.50
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126385
ns126624
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2059229.5
ns2031958
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2014083.5
ns2023083.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2030292
ns1756000
ns1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2020416.5
ns2029583
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1061374.5
ns1029084
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
2375
ns1750
ns1.36
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
1834
ns2833
ns0.65
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
3542
ns2458
ns1.44
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
2167
ns2166.5
ns1.00
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16672
ns16055
ns1.04
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2541
ns2583
ns0.98
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
2917
ns2500
ns1.17
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
2750
ns2750
ns1
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2792
ns2750
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
197485.5
ns191618
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7416
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5416
ns5917
ns0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5958
ns5125
ns1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
9916
ns10166
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34400.5
ns33917
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213812.5
ns226396.5
ns0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221000
ns222521
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
231917
ns221584
ns1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
208604
ns207458
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
352524
ns311723.5
ns1.13
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3750
ns3667
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3750
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22677
ns22860
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14416
ns14458
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14125
ns14291
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14500
ns14250
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14417
ns14667
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
511650.5
ns472859.5
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
93854
ns137417
ns0.68
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
97145.5
ns96458.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
98417
ns95833
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
140083
ns93125
ns1.50
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
125784
ns125940
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1964729
ns1921458.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1938562.5
ns1918166.5
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1927041.5
ns1817687.5
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1920667
ns1914458
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1039090
ns951464
ns1.09
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
877500
ns869042
ns1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
800812.5
ns815167
ns0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1223937
ns1175833
ns1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
969958
ns967562.5
ns1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA
285567
ns276671
ns1.03
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2803854
ns2830583
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2511750
ns2508062.5
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3356541.5
ns3332875
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3428708
ns3328000
ns1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1675606
ns1576106.5
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
15958
ns16000
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16562.5
ns15625
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
17041.5
ns16458
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17375
ns16417
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145484
ns143900.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223104
ns255875.5
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222896
ns254271
ns0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
226708
ns216250
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
253167
ns258021
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
664599
ns637843.5
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
221146
ns220792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221500
ns220667
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
221666.5
ns221208
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
221042
ns222208.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
276464
ns270997
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
551791
ns504458
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
505375
ns507416.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
509750
ns499833.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
508666.5
ns498875.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1493627
ns1304306.5
ns1.15
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
4000
ns3459
ns1.16
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
4104.5
ns3854.5
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
4667
ns5375
ns0.87
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
4042
ns4042
ns1
batchedmm(16, Bsize=4)/forward/GPU/CUDA
17326
ns16660
ns1.04
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
7042
ns7166
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
7417
ns6458
ns1.15
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
7250
ns7209
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
7458
ns7541.5
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
198652.5
ns194930.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17875
ns17666
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18333
ns17125
ns1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
19750
ns19729
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17146
ns18000
ns0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
230076
ns146357.5
ns1.57
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219250
ns244562
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216020.5
ns237417
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212500
ns214500
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212479.5
ns225208
ns0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1050719
ns894981
ns1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
4500
ns4416
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
4583
ns3917
ns1.17
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
4667
ns5334
ns0.87
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
4583
ns4833
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
252077
ns187684
ns1.34
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10833
ns10500
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10500
ns9708
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10250
ns11167
ns0.92
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10250
ns11250
ns0.91
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1102570
ns1024651
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3312.5
ns3209
ns1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3708
ns3250
ns1.14
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
3959
ns4687.5
ns0.84
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
3125
ns3791
ns0.82
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
243703
ns218725.5
ns1.11
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7229.5
ns7833
ns0.92
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7333
ns7291
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7417
ns7625
ns0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7209
ns7917
ns0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1111590.5
ns1043721.5
ns1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23487541.5
ns23437104.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
43971125
ns35045979.5
ns1.25
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
37463166.5
ns41490500
ns0.90
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34877416
ns34913479
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1842834.5
ns2126334.5
ns0.87
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184200958
ns184798459
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
173422437.5
ns159330000
ns1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
146460271
ns151477459
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
410950833
ns411547250
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16526176
ns16524151
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
425975000
ns427197208
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
259298209
ns252723645.5
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
296349208.5
ns305721250
ns0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
479307000
ns481095166
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183167
ns182854.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
183917
ns182791.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
185291.5
ns185292
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183708.5
ns185750
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
232992
ns173677.5
ns1.34
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
588709
ns629833
ns0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
595709
ns631375
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
596042
ns590542
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
597500
ns630770.5
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1113560
ns1010062
ns1.10
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
4043292
ns3848041.5
ns1.05
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
4012396
ns4009000
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3557000
ns3525583
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
4569124.5
ns4614917
ns0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA
531536
ns536882
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17494562.5
ns17371917
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
18560917
ns17740624.5
ns1.05
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
16622646
ns16856312.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
20213416.5
ns20403334
ns0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2619803.5
ns2613028
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
625
ns500
ns1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns667
ns0.81
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32024.5
ns31917
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9334
ns9334
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9291
ns8708
ns1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9666.5
ns9875
ns0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9000
ns9417
ns0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
264542.5
ns260614
ns1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
496971791
ns503086958
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
509285541
ns424620083.5
ns1.20
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
421912146
ns462339520.5
ns0.91
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
672227417
ns673052062
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12489793.5
ns12478664.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
1883911021
ns1872018104.5
ns1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1668824291
ns1625413500
ns1.03
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1489797958.5
ns1546440125
ns0.96
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2201017208.5
ns2200566458.5
ns1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49197806.5
ns49139909
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1600645.5
ns1647791.5
ns0.97
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1172708
ns1202542
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1388125
ns1365999.5
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2344958.5
ns2393042
ns0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
218458
ns215162
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12685750
ns12703083.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9976000
ns9880000
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9656709
ns9761146
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18427396
ns18559417
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2044469
ns2005712
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17712834
ns17693854
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14779375
ns14669187.5
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14604916
ns14767500
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21383042
ns21469542
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26250
ns26250
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26250
ns26208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26208
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26208
ns26292
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24118
ns23799
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
67000
ns66666
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66833
ns66750
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67500
ns67209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66834
ns67500
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
410737.5
ns380551.5
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203917
ns203917
ns1
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
208625
ns209750
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209084
ns210000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199500
ns199958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27195
ns25800
ns1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
625958.5
ns648229.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
629916
ns661271
ns0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
632125
ns622750
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
600062.5
ns586375
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
358637.5
ns308724.5
ns1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
658417
ns600291
ns1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
641625
ns594125
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
647542
ns544666
ns1.19
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
666291.5
ns652208
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132681.5
ns131751
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2274708
ns2235000
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2300125
ns2235625
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2238125
ns2300854
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2241291
ns2253125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1242340
ns1127758
ns1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18020.5
ns17541
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18292
ns16958
ns1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20250
ns19917
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17500
ns17958
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146876.5
ns145385
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
231458
ns261583
ns0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227333.5
ns260812.5
ns0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
227500
ns220937.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
229792
ns230896
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1067171
ns982925
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
667
ns542
ns1.23
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
625
ns542
ns1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns667
ns0.81
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23878
ns23015
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9833
ns9479.5
ns1.04
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9875
ns9042
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10000
ns10292
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9541
ns9625
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
263281
ns257388
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5687.5
ns5458
ns1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6208
ns5417
ns1.15
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7125
ns6625
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5417
ns6083
ns0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
235834
ns233603.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7250
ns7083
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns7041
ns1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7541.5
ns7833
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6979.5
ns7375
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
811982.5
ns800650
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2125
ns2000
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2312.5
ns2125
ns1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2500
ns2458
ns1.02
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2125
ns2459
ns0.86
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18261
ns17988
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6375
ns6500
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6520.5
ns6291
ns1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6708
ns6708
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6375
ns6542
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
336632.5
ns330671
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
749209
ns749709
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
748895.5
ns747104
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
749542
ns749208
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
754083
ns751791.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21329
ns21045
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
818750
ns791000
ns1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
788167
ns791062.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
791584
ns775875
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
790584
ns775250
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
299791
ns294695
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7500
ns7208
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5334
ns5958
ns0.90
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5916
ns5291
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10208
ns10208
ns1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33718
ns32534
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
256167
ns233291
ns1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
235520.5
ns267375
ns0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
240500
ns227812.5
ns1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
250875
ns213583
ns1.17
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
365654
ns361573
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10312.5
ns10020.5
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
10416
ns10042
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
10812.5
ns11625
ns0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10166.5
ns10208
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
245731
ns248981.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
25083
ns26791
ns0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24667
ns24292
ns1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24125
ns24750
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24500
ns25000
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1139764
ns1132389
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106439229
ns107227250
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
127176500
ns117058791.5
ns1.09
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
120453645.5
ns124034229
ns0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
117602312.5
ns117545541.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2646453
ns2659866
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
394264417
ns393155000
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
380211666
ns366597250
ns1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
421708312.5
ns357674666
ns1.18
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
479818917
ns490403667
ns0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15158878
ns15157994
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
756832624.5
ns758865499.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
775894292
ns580033084
ns1.34
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
748243271.5
ns748265062.5
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
761933208.5
ns948608916.5
ns0.80
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7145.5
ns6916.5
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7834
ns7000
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9541
ns8042
ns1.19
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7417
ns7625
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
241749
ns242461.5
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14291.5
ns14084
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14166
ns13500
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14167
ns14208
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13708
ns14333
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1098247
ns1085062
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6042
ns5541
ns1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6750
ns6563
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7083
ns7666
ns0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5834
ns6291
ns0.93
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
240471.5
ns235371.5
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12667
ns12542
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13333
ns12104.5
ns1.10
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13354.5
ns13042
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12334
ns12750
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
800476.5
ns793450.5
ns1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
5333
ns5125
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
5875
ns5750
ns1.02
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
6000
ns6333
ns0.95
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
5500
ns5625
ns0.98
batchedmm(2, Bsize=128)/forward/GPU/CUDA
17559
ns16571
ns1.06
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
15459
ns15792
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
15437.5
ns15417
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
15667
ns15625
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
15750
ns15750
ns1
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
202574
ns200110.5
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
417
ns292
ns1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns416
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns417
ns0.70
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24102
ns23594.5
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6333
ns5959
ns1.06
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6209
ns6083
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6666
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6333
ns6834
ns0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
242831.5
ns242427.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5916
ns5833
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5917
ns5834
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5958
ns6000
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5792
ns6041
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25033
ns24342.5
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21375
ns20875
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21125
ns21042
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21375
ns21666
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21020.5
ns21875
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
267836
ns262727.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
144833
ns185833
ns0.78
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145250
ns144916.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150083.5
ns146875
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
188375
ns144416.5
ns1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168310
ns167734
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1351833
ns1323750
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1369333
ns1312209
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1322041
ns1332875
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1327250
ns1333770.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1368007
ns1339118
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23042
ns24041.5
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24041
ns22312.5
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
24917
ns24833
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
21833
ns24667
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
356401.5
ns351890.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
126958
ns170708
ns0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
120333
ns177875
ns0.68
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
180250
ns118625
ns1.52
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
180749.5
ns120020.5
ns1.51
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1484885
ns1461877
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns416
ns0.70
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23370
ns22590
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6479.5
ns6250
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6416
ns6250
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7042
ns6750
ns1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6333
ns6583
ns0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
260419.5
ns255552.5
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4333
ns4291
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5041.5
ns4417
ns1.14
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5459
ns5708
ns0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4583.5
ns5292
ns0.87
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
255220
ns256272
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10166.5
ns10042
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10167
ns9833
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10250
ns10417
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10333
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1368092
ns1354208
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1666
ns1625
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1666
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23227
ns22798
ns1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5750
ns5833
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5750
ns5709
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5750
ns6000
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5583
ns5916
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
278026
ns274328
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6781854.5
ns6866624.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6363854.5
ns6433708
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6534166
ns6554499.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7654958.5
ns7548875
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
216771
ns213149
ns1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24093667
ns24100417
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21335604
ns21294521
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21037958
ns21070125
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29730292
ns29826667
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2100300
ns2116806
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37311042
ns37336834
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
45649479
ns34197292
ns1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45692458
ns45794042
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
38098959
ns49624208
ns0.77
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5520.5
ns5750
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6708.5
ns5625
ns1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7250
ns6791
ns1.07
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6125
ns6667
ns0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
240533.5
ns236202.5
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8083
ns8084
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9083
ns7875
ns1.15
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8417
ns8667
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8250
ns9167
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1077102
ns1060405
ns1.02
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1489187.5
ns1553542
ns0.96
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1236771
ns1263041.5
ns0.98
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1617916
ns1622041
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2170020.5
ns2175916
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
282849
ns272178
ns1.04
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7909229.5
ns7902375
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6634750
ns6258292
ns1.06
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7161708
ns7165958
ns1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10483708.5
ns10478104.5
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1903700.5
ns1852121.5
ns1.03
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
367625
ns361584
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
349896
ns370750
ns0.94
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
453917
ns456417
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
24459
ns24999.5
ns0.98
batchedmm(128, Bsize=4)/forward/GPU/CUDA
43502
ns46439.5
ns0.94
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
727167
ns738895.5
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
803167
ns809958
ns0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1057604
ns1082542
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
121792
ns76708
ns1.59
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
307546.5
ns301861.5
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397583
ns397459
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
213333
ns288084
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288209
ns212208
ns1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
751125
ns755209
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
44141
ns43701
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
675500
ns665625
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
475667
ns530417
ns0.90
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
531375
ns473750
ns1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
972666.5
ns974458
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
191213
ns189749
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
658208.5
ns649583
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
643834
ns641833
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
655125
ns545458.5
ns1.20
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
681792
ns653167
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132164.5
ns131877
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2526833
ns2454834
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2530541
ns2460271
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2451667
ns2500666
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2454146
ns2518479
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1206173
ns1202049
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
2604
ns3000
ns0.87
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
2459
ns3500
ns0.70
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
4375
ns3500
ns1.25
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
2583
ns2708
ns0.95
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16766
ns15904
ns1.05
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
5333
ns5375
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
5542
ns5292
ns1.05
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
5583
ns5666
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
5542
ns5750
ns0.96
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
199467
ns196388
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1459833
ns1465625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1490334
ns1502708
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1497791
ns1496875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1439750
ns1444792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
41167
ns40558
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5155562
ns5125396
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5314187.5
ns5286583
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5282833
ns5312375
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4979791
ns4974792
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
198405.5
ns195790.5
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3708
ns3709
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3750
ns3708
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
33352
ns32748
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15208
ns15083
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15000
ns15083
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15209
ns15167
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15291
ns15375
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
379437.5
ns375651.5
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71625
ns71125
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71416
ns71167
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71208
ns71208
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
71083
ns71083
ns1
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113188.5
ns112958
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
321770.5
ns323791
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
330770.5
ns320458
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
319333
ns326875
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
326458
ns323000
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
194877
ns193747
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
1000
ns1000
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
1083
ns958
ns1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
959
ns1084
ns0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23702
ns23358
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7917
ns7875
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8125
ns7834
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns8458
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7916
ns8833
ns0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
263485
ns259209
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
497624.5
ns505375
ns0.98
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
471604
ns484292
ns0.97
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
563708
ns564542
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
218208
ns215062.5
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129739
ns128754
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1355292
ns1371334
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1470187.5
ns1393812.5
ns1.05
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1719583.5
ns1732333
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
867375
ns870083.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
275487
ns276302
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
417
ns333
ns1.25
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns375
ns0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31436
ns31400
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6208
ns6167
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6333
ns6000
ns1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6458
ns6500
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6333
ns6958
ns0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
262275
ns263074.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1727063
ns1767042
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1729458
ns1725208
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1725417
ns1727292
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1768875
ns1726271
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
168537
ns168554
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4367874.5
ns4357521
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4385375
ns4359541
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4367104
ns4379875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4357459
ns4377583
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1262273
ns1157059
ns1.09
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6708
ns6666
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6541
ns6666
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7000
ns6916
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6875
ns7041.5
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20525
ns20567
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
33063
ns32834
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
33083
ns51229.5
ns0.65
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
48041.5
ns33541.5
ns1.43
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
53792
ns51062.5
ns1.05
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
291536.5
ns209739.5
ns1.39
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
17333.5
ns17250
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
17792
ns17812.5
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
18209
ns18292
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
17666
ns17708
ns1.00
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18396
ns17907
ns1.03
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
53209
ns53208
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
53417
ns52959
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
53292
ns53541
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
53375
ns53291
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
338706.5
ns344400
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75500
ns75333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75417
ns74959
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75292
ns75292
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
75292
ns75000
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46489
ns47022
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
329084
ns325292
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
336667
ns324417
ns1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
328958
ns343042
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
323917
ns327084
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
209091.5
ns210359
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1486166
ns1488333
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1517709
ns1527917
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1525792
ns1521042
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1464375
ns1466167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52406
ns51138
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5153729.5
ns5120375
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5303250
ns5285750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5257500
ns5309459
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4990145.5
ns4973917
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
203681
ns202631
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28208
ns28125
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28250
ns28208
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28375
ns28209
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24536
ns24478
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66708
ns66208
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66125
ns66167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66250
ns66250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66416
ns66959
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
535849
ns533201
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1468041
ns1463833
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
912854
ns1144583
ns0.80
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
1130187.5
ns832188
ns1.36
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2251604
ns2217792
ns1.02
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
583084
ns576305
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3113959
ns3077958.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2660771
ns2733167
ns0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2734000
ns2620334
ns1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3802646
ns3782000
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
2002672
ns2001343
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
7929500
ns7887749.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8011167
ns7887771
ns1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
7911791.5
ns7989000
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
4826833
ns4832458
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
81437.5
ns134958
ns0.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
83395.5
ns78917
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
84437.5
ns82625
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
136500
ns81250
ns1.68
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
193251.5
ns193237.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2033479
ns2017354.5
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2014584
ns2006750
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2016000
ns2041167
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2013958
ns2018875
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
792396
ns797402
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.