-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: update default rng for reactant #1152
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 5db43a7 | Previous: 63d3434 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4166 ns |
4083.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3979.5 ns |
4042 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5083 ns |
4917 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3812.5 ns |
3833 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
62377 ns |
59941 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10458 ns |
11250 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10542 ns |
10500 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10917 ns |
11541 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10958 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
423465.5 ns |
421187 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1125 ns |
1167 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1334 ns |
1250 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1167 ns |
1167 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
17673 ns |
17939 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4167 ns |
4125 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4042 ns |
3958 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4458 ns |
4292 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3916 ns |
4062.5 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
108782.5 ns |
108432 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58666 ns |
57333 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46208 ns |
46250 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46500 ns |
47041 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82000 ns |
82125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37058 ns |
36736 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2057542 ns |
1991000.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2088812.5 ns |
2094313 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2085708 ns |
2094167 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2000917 ns |
1997041.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196173 ns |
194384.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
188166.5 ns |
143854.5 ns |
1.31 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146417 ns |
143125 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
168667 ns |
147041 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144583 ns |
144750 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166189 ns |
165602 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1152084 ns |
1114896 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1114937 ns |
1128937.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1113917 ns |
1128792 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1124792 ns |
1114542 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526984 ns |
526049 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3958 ns |
3458 ns |
1.14 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3791 ns |
3416 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4292 ns |
4145.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3209 ns |
3584 ns |
0.90 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
66944 ns |
70040 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9292 ns |
8917 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10458 ns |
9042 ns |
1.16 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9625 ns |
9459 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9000 ns |
8917 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
476984 ns |
447136 ns |
1.07 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15666.5 ns |
15041 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15250 ns |
17541.5 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17375 ns |
17625 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15500 ns |
15917 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
54731 ns |
54471 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226333 ns |
217417 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215770.5 ns |
213417 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214541.5 ns |
214979.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214771 ns |
225771 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
273572 ns |
270355 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
750 ns |
791 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
750 ns |
625 ns |
1.20 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
708 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
708 ns |
667 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17271.5 ns |
17190 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1500 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1500 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1750 ns |
1666 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1417 ns |
1500 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
101757 ns |
101385 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7208 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5916 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5917 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
9875 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23740 ns |
23163 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
233042 ns |
223083 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229417 ns |
228500 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230042 ns |
230208 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214167 ns |
217000 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
168563 ns |
166961 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23692 ns |
23600 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16875 ns |
16792 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
17167 ns |
16750 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16958 ns |
17041 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16750 ns |
17000 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
161637.5 ns |
161078 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
604292 ns |
577750 ns |
1.05 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
572625 ns |
572709 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
576208 ns |
574833 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
574666 ns |
575625 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113368 ns |
112893 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1449125 ns |
1420292 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1425458 ns |
1425209 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1417250 ns |
1426583 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1422625 ns |
1429020.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
212042 ns |
211317.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1083896 ns |
1077500 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
955958 ns |
960792 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1346958 ns |
1350854.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1303396 ns |
1298750 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
270063.5 ns |
273506 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5906562.5 ns |
6004937.5 ns |
0.98 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4521396 ns |
4547292 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4939270.5 ns |
4929708.5 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5518541 ns |
5555333 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1072796 ns |
1074648 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
583 ns |
0.93 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23647 ns |
23430 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2250 ns |
2167 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
2167 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2166 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
170300.5 ns |
173597 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4583 ns |
4292 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
3833 ns |
3750 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5000 ns |
4917 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3854.5 ns |
3958 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65114 ns |
65160 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12000 ns |
11209 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11416 ns |
11250 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12166 ns |
12208 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11042 ns |
11125 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
449470 ns |
447745.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7270.5 ns |
6166 ns |
1.18 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6084 ns |
6375 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7709 ns |
8125 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6041 ns |
6583 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
52133.5 ns |
52163 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17709 ns |
16750 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17750 ns |
18209 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
19167 ns |
18500 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17459 ns |
17000 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
299326.5 ns |
298259.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
583 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32714 ns |
32532 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9083 ns |
8208 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8709 ns |
8667 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9584 ns |
9333 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
8083 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
160130 ns |
158900.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
65125 ns |
64500 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64333 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64416 ns |
64458 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64500 ns |
64375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112090.5 ns |
111633.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
284417 ns |
274542 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
283229.5 ns |
287042 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
276375 ns |
274708 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
274250 ns |
280292 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
186688.5 ns |
186083 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3285312 ns |
3329333 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3024209 ns |
3017229 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3022833 ns |
3024687.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4057542 ns |
3956250 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
576630.5 ns |
577429 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7680958 ns |
7623958 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7452250 ns |
7210334 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7441167 ns |
7453270.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8201292 ns |
8209375 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1341272 ns |
1359043.5 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17502166.5 ns |
17513124.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17538625 ns |
17530146 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17554667 ns |
17518395.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14123291 ns |
14128813 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23611750.5 ns |
23645979.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34023937.5 ns |
33821104.5 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37089000 ns |
37080041 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34892958.5 ns |
34888834 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1859854 ns |
1866294 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
188302208 ns |
189046208 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
163879813 ns |
164619624.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152774791.5 ns |
152711479 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
433871375 ns |
436948083 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13921725.5 ns |
13894254.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
289641667 ns |
289373791 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
250562000 ns |
251042625 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
297502250 ns |
296809167 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
473866146 ns |
474994229.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24208 ns |
22250 ns |
1.09 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21875 ns |
24542 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24834 ns |
23188 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21625 ns |
22417 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97906 ns |
96027 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
104333.5 ns |
116584 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
105000 ns |
113125 ns |
0.93 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
105041.5 ns |
117833 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
102667 ns |
103854 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
513627.5 ns |
510213 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6708 ns |
5833 ns |
1.15 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5708 ns |
5917 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6667 ns |
6812.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5937.5 ns |
6292 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68769 ns |
68158.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14959 ns |
14875 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15459 ns |
14812.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15667 ns |
14875 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14750 ns |
15042 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
486570.5 ns |
478636.5 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3028666.5 ns |
3009146 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2062834 ns |
2061334 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2293375 ns |
2279208 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4537375 ns |
4871541.5 ns |
0.93 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
588519 ns |
589315.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23569083 ns |
23547375 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
17970166 ns |
17982875.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16984875 ns |
16893209 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34970875 ns |
34849958 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2769808.5 ns |
2772744 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33513292 ns |
33314834 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27545208.5 ns |
27464208 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27453833 ns |
27410208 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
40974459 ns |
41078500 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74167 ns |
72375 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
71979.5 ns |
74375 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75167 ns |
75166 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72292 ns |
75167 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104951 ns |
102682 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
308750 ns |
286145.5 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
321771 ns |
210021.5 ns |
1.53 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
321167 ns |
315000 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219499.5 ns |
218458 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
547081 ns |
553543 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12083 ns |
11875 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12041 ns |
11708 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12292 ns |
13334 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11645.5 ns |
13125 ns |
0.89 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
69950.5 ns |
71259 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26708 ns |
26833.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27417 ns |
26375 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27333 ns |
27417 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26750 ns |
25854.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
473806 ns |
477064.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12958 ns |
12041.5 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12125 ns |
12229.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13542 ns |
13958 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12083 ns |
12584 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
52821 ns |
53895.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26208 ns |
25875 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26417 ns |
25834 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26167 ns |
26125 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26083 ns |
25667 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
301072.5 ns |
305285 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179709 ns |
179417 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182000 ns |
179417 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182875 ns |
181041 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179708 ns |
180042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
57140 ns |
58113 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584541.5 ns |
590084 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
585375 ns |
585083 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
594791 ns |
591062.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
583104 ns |
584333 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
289093 ns |
289662.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6541 ns |
6083 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5500 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6667 ns |
7542 ns |
0.88 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5625 ns |
6604.5 ns |
0.85 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
70191 ns |
70599 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14334 ns |
14291 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14958 ns |
14209 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14770.5 ns |
14917 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14166 ns |
13062.5 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
463731.5 ns |
466681.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1233354 ns |
1223541.5 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1241354 ns |
1236625 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1307584 ns |
1285666.5 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1016334 ns |
1007959 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
302150 ns |
301986 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4139000 ns |
4226959 ns |
0.98 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4384875 ns |
4384249.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4566250 ns |
4572312.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3703750 ns |
3695104.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1038963 ns |
1047036 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23482 ns |
24200 ns |
0.97 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5000 ns |
4875 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
4833 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4959 ns |
4875 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4875 ns |
4875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
188186.5 ns |
192268.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6084 ns |
5458 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5542 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7000 ns |
6791.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5458 ns |
5792 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
54827 ns |
56595.5 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11541 ns |
10500 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11042 ns |
10416 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11541 ns |
11375 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10708 ns |
10875 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
328929.5 ns |
335979.5 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
334 ns |
0.87 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
375 ns |
333 ns |
1.13 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
334 ns |
334 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23065 ns |
23172 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2834 ns |
2833 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2709 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3041 ns |
3042 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2791 ns |
2791 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
159599.5 ns |
162255.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11875 ns |
11084 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11270.5 ns |
11000 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12895.5 ns |
13563 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11292 ns |
11458 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
56968 ns |
58685.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25333 ns |
24542 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25062.5 ns |
24542 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25458 ns |
25167 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24750 ns |
25000 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
287900 ns |
298266 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4250 ns |
4208 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4250 ns |
4250 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24679 ns |
25307 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16625 ns |
16166 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16125 ns |
16292 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16625 ns |
16334 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16208 ns |
16084 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
195350.5 ns |
199542 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5833 ns |
5709 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5791 ns |
5917 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5791 ns |
5834 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
33389 ns |
33833 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21000 ns |
20292 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20834 ns |
20375 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21083 ns |
20875 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21000 ns |
20250 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
175063 ns |
178083 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
424229.5 ns |
420500 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
382042 ns |
372625 ns |
1.03 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
477708 ns |
482833 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
104666.5 ns |
103292 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66983 ns |
67723.5 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
882187.5 ns |
922417 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
977667 ns |
955208.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1179166.5 ns |
1180875 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
383458 ns |
379083 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
188560 ns |
192988 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80167 ns |
136917 ns |
0.59 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80458 ns |
79854.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83104 ns |
82750 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80812.5 ns |
81167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193407.5 ns |
194081 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1950292 ns |
1915042 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1903208 ns |
1919750 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1922208 ns |
1926125 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1919104 ns |
1915750 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
398056.5 ns |
401908.5 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21900 ns |
22364 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
169251 ns |
174295 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7000 ns |
6042 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6500 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7645.5 ns |
7812.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6541 ns |
6541 ns |
1 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
58682 ns |
61489.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9584 ns |
9000 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9250 ns |
8792 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9542 ns |
9375 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9000 ns |
9459 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
299860 ns |
308375 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120365479 ns |
118419979.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174058791 ns |
173770000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148250542 ns |
148397083 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106407666 ns |
104919541 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5471253 ns |
5493586 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
611069708.5 ns |
611739750.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
551988959 ns |
553521958 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
449773604.5 ns |
449841709 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
629581291.5 ns |
631089333.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
38220940.5 ns |
38209825 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
662619541 ns |
652096250 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
664048916.5 ns |
661126562.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
590023604 ns |
580970687.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
859382500 ns |
848782167 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
60333 ns |
58667 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47375 ns |
47500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47459 ns |
48250 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84416 ns |
83625 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37873 ns |
37628 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1951062.5 ns |
1919312.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1961417 ns |
1980333.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1976854.5 ns |
1982541.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887041.5 ns |
1895625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
175346 ns |
176341 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
266583 ns |
266208 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
265916 ns |
265334 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
288687.5 ns |
288604 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267875 ns |
268167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
130934 ns |
130454.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
694458 ns |
664646 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
699271 ns |
671062.5 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
693292 ns |
665875 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
595500 ns |
597542 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
689357.5 ns |
690208 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2189917 ns |
2192312.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2219750 ns |
2179542 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2218583 ns |
2181333.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2179271 ns |
2207146 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
134898 ns |
134808 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5543688 ns |
5469791 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5493750 ns |
5472958.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5499792 ns |
5499916 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5500125 ns |
5442583.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
773264.5 ns |
720984 ns |
1.07 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
648250 ns |
644667 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
636583 ns |
644084 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
639917 ns |
642042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
643375 ns |
644167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47646 ns |
47636.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1859250 ns |
1819917 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1718062 ns |
1720500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1731958 ns |
1721792 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2102291 ns |
2100000 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
227022 ns |
224071 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59208 ns |
57667 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46917 ns |
46666 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47187.5 ns |
46583 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84750 ns |
83750 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29453 ns |
28795 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2049209 ns |
2029583 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2082750 ns |
2087375 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2084729.5 ns |
2087791.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2003209 ns |
1991416.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
193498.5 ns |
190320 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13383917 ns |
13371041.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12429000 ns |
12439187.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12553500 ns |
12491875 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15134250 ns |
15195833.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
511588 ns |
516777 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47363459 ns |
47119104.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41713041 ns |
41727062.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40957334 ns |
41051417 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58315270.5 ns |
58599458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2895504 ns |
2892052.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97529916.5 ns |
74212666 ns |
1.31 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90590375 ns |
67877750 ns |
1.33 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90429374.5 ns |
90536499.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
76443542 ns |
98549792 ns |
0.78 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59583 ns |
58375 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46334 ns |
46459 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47375 ns |
47708 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82666 ns |
83958 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47291 ns |
47165 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1947979.5 ns |
1919583.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1961250 ns |
1980791 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1973271 ns |
1979229.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1897250 ns |
1886958 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
192219 ns |
193816.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
417 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32079 ns |
32624 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
5833 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6333 ns |
6083 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6416.5 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
5937.5 ns |
5833 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
170833.5 ns |
171378.5 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32116 ns |
32204 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2583 ns |
1.11 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2625 ns |
1.08 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
2875 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2667 ns |
2625 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
160054.5 ns |
159764 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
283749812 ns |
286393770.5 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
341288375 ns |
340253500 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314539291.5 ns |
313806270.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
271510208 ns |
268566520.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7104228 ns |
7103110 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1010627500 ns |
1012043792 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
956683750 ns |
955581708 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
852193146 ns |
855297583 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1261954792 ns |
1259239875 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33827664 ns |
33847341 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1667661584 ns |
1418325958.5 ns |
1.18 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1666216916 ns |
1338395020.5 ns |
1.24 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1606436083 ns |
1636087292 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1366973687.5 ns |
1775858125 ns |
0.77 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1403291 ns |
1409833 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1427917 ns |
1414458.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1420291.5 ns |
1465562.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1407083 ns |
1413458.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127686 ns |
127951 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5077374.5 ns |
5027250 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5020479 ns |
5036354 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5013208.5 ns |
5030437.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5011896 ns |
5027250.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
590954 ns |
479205.5 ns |
1.23 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
169989042 ns |
170869291 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
124556708 ns |
128735708 ns |
0.97 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
117703542 ns |
105431542 ns |
1.12 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
167773041 ns |
167706958 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4847276 ns |
4877746.5 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
628606625 ns |
511068334 ns |
1.23 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
495985459 ns |
490911792 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
384929709 ns |
385742875 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
647293458 ns |
650161000 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16634773 ns |
16340937 ns |
1.02 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8966583.5 ns |
9003042 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8908167 ns |
8983042 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7918042 ns |
7909375 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9735042 ns |
9604229.5 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1592278 ns |
1611438.5 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36520208 ns |
36334167 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37244833 ns |
37265291.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33588209 ns |
33553354 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37756875 ns |
37555333 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6456339 ns |
6454550 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47520.5 ns |
47333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47625 ns |
47500 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47875 ns |
47625 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47666 ns |
47417 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18617 ns |
18252 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50459 ns |
50417 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50500 ns |
50666 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50500 ns |
50625 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50458 ns |
50250 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
207525 ns |
164880 ns |
1.26 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7084 ns |
6417 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6291 ns |
6792 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7667 ns |
7583.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7083 ns |
6792 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
102649 ns |
76692.5 ns |
1.34 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
10125 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
9750 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10459 ns |
10250 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
9875 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
608896.5 ns |
448214.5 ns |
1.36 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6229.5 ns |
5666 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
5791 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7292 ns |
7583 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6042 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
127225 ns |
81735 ns |
1.56 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13625 ns |
13208 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13208 ns |
12709 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
14291 ns |
13375 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13333 ns |
13417 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
504128.5 ns |
399198.5 ns |
1.26 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1084 ns |
959 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
33015 ns |
32447 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8292 ns |
7666 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7875 ns |
7708 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
7958 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7958 ns |
8166 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
206728 ns |
187787.5 ns |
1.10 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23583 ns |
23167 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23500 ns |
23209 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23792 ns |
23250 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23584 ns |
23292 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18598 ns |
18320.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52667 ns |
52917 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52625 ns |
52167 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52875 ns |
52917 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52792 ns |
52875 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
277051.5 ns |
214503.5 ns |
1.29 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1457958 ns |
1398125 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1399791 ns |
1402146 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1407542 ns |
1406437.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1404458.5 ns |
1448937.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196597 ns |
196187.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5037333 ns |
5003458 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4999291.5 ns |
5029708 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5017000 ns |
5015042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5011541.5 ns |
5005729.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
588570.5 ns |
509817 ns |
1.15 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3068291.5 ns |
3051834 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2088000 ns |
2076520.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2298792 ns |
2302500 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4816145.5 ns |
4658291.5 ns |
1.03 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
579682.5 ns |
581685 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24422833.5 ns |
24315708 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18831541.5 ns |
18877250 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17767708 ns |
17822166 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35768979 ns |
35790999.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2840747 ns |
2842698 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34254979.5 ns |
33982916.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28263375 ns |
28228208.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28014875.5 ns |
27940958 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41850458 ns |
41757334 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144995833 ns |
143078500 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
146762458.5 ns |
146668125 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
126155437.5 ns |
127355624.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
172892333 ns |
171841729.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22763599 ns |
22550146 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1302781041.5 ns |
1234730083.5 ns |
1.06 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1105540874.5 ns |
1060723417 ns |
1.04 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
818915959 ns |
1027004875 ns |
0.80 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
672266208 ns |
674561583 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118087389 ns |
117659213 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72667 ns |
74125 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
72541 ns |
73146 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
81041 ns |
76000 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
85000 ns |
85834 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
215492.5 ns |
175925 ns |
1.22 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
297062.5 ns |
215750 ns |
1.38 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
285541.5 ns |
192541.5 ns |
1.48 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
287041.5 ns |
284542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
287958 ns |
285708 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1143307 ns |
952026.5 ns |
1.20 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35739792 ns |
35486000 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36309625 ns |
36428646.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32471312.5 ns |
32475229 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40331146 ns |
40408041.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5837713 ns |
5831517 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
148519833 ns |
146000771 ns |
1.02 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
151197291.5 ns |
154808750 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
135647708.5 ns |
137043083.5 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
285564667 ns |
285556542 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34868506.5 ns |
34852076.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121927770.5 ns |
121592083 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173726125 ns |
174639125 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147938542 ns |
148027541 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106465209 ns |
105917833 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5469946 ns |
5344344 ns |
1.02 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
469829834 ns |
468650958 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466404000 ns |
466713000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
440435500 ns |
437158458 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
742141416 ns |
744371959 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35183973.5 ns |
35992005 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
708884520.5 ns |
712765167 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
638944062.5 ns |
641204167 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
620145875.5 ns |
624084979.5 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
853299834 ns |
856208084 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1349459 ns |
1270583 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
987708 ns |
995709 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
972125 ns |
995875 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2058666.5 ns |
2037625 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
571992 ns |
569478 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
3012354.5 ns |
2961229.5 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2612541.5 ns |
2647792 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2622125 ns |
2621500 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3696542 ns |
3709750 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1700920.5 ns |
1587708.5 ns |
1.07 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5829708 ns |
5785812.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5779417 ns |
5824083 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5792500 ns |
5785375 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2892874.5 ns |
2904896 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7250 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
6125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
6042 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10042 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25317 ns |
24479.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225041 ns |
223812.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
231875 ns |
222667 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220625 ns |
220792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214333 ns |
240666 ns |
0.89 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
240148.5 ns |
212315.5 ns |
1.13 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
297383209 ns |
296229125 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
218983500 ns |
216728584 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
195262458 ns |
190254604.5 ns |
1.03 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
306663354 ns |
304954521 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7666475 ns |
7671461.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1233887896 ns |
1229817167 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
895083125 ns |
902846291.5 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
815769750 ns |
824304209 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1143449395.5 ns |
1157856750.5 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26736681 ns |
26996841 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5791 ns |
5292 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4917 ns |
5291.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6583 ns |
6375 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5542 ns |
5250 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
147831.5 ns |
112898 ns |
1.31 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7791 ns |
6875 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7166 ns |
6958 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7792 ns |
7583 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7292 ns |
7125 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
563806 ns |
535221.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
584 ns |
1.14 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
541 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23941 ns |
23660 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9459 ns |
8625 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9209 ns |
9084 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10208 ns |
9417 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9000 ns |
8708 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
201384.5 ns |
195936.5 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
356459 ns |
352958.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
353125 ns |
352792 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352041 ns |
351479 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
354041 ns |
356708.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21182 ns |
20962 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
834521 ns |
775625 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
777562.5 ns |
825833 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
812209 ns |
812229.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
830208 ns |
834959 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
253346 ns |
234827 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
341437.5 ns |
341562.5 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
346041 ns |
341958 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
451667 ns |
455917 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
11104.5 ns |
11083 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17955 ns |
17699 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
718625 ns |
712500 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
732125 ns |
739896 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1003979 ns |
1007854 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
27625 ns |
26459 ns |
1.04 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
227406.5 ns |
214680.5 ns |
1.06 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
378875 ns |
381042 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
350708 ns |
346750 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
442208 ns |
449187.5 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
31583 ns |
39042 ns |
0.81 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22686 ns |
22537 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
737208 ns |
733792 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
780083 ns |
788958 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1026792 ns |
1032500 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
104333 ns |
105583 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
206514.5 ns |
200835.5 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
3791 ns |
0.92 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3541 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3792 ns |
3708 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3541.5 ns |
3708 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17783 ns |
17542 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
4250 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4209 ns |
4167 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4375 ns |
4250 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4167 ns |
4250 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
234795 ns |
204574.5 ns |
1.15 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3875 ns |
3834 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3541 ns |
3667 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4375 ns |
4250 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3625 ns |
3625 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
190263 ns |
160115.5 ns |
1.19 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8667 ns |
8292 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8166 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8604.5 ns |
8458 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208.5 ns |
8333 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1137360 ns |
989699 ns |
1.15 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
205416 ns |
203375 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211792 ns |
212791 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
215417 ns |
210666 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201417 ns |
200834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34517 ns |
34428 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
645937.5 ns |
652624.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
629292 ns |
622667 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
631916.5 ns |
631604.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
629958 ns |
632750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
324473.5 ns |
280400.5 ns |
1.16 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1023208 ns |
994229.5 ns |
1.03 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1019708 ns |
1040292 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
953687.5 ns |
956020.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
859854 ns |
853917 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
206050 ns |
208023.5 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4569708 ns |
4502437.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4685750 ns |
4668229.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4464417 ns |
4455084 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4253187.5 ns |
4280937 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
925934 ns |
935555 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4000 ns |
3292 ns |
1.22 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3125 ns |
3458 ns |
0.90 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
3958 ns |
4042 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3167 ns |
3209 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
204899.5 ns |
159049 ns |
1.29 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7750 ns |
7291 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7333 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7334 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6875 ns |
6833 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
966567 ns |
850635.5 ns |
1.14 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1646958 ns |
1640041 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1166750 ns |
1196604.5 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1367791.5 ns |
1383250 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2480916 ns |
2417500 ns |
1.03 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
212410 ns |
215018 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12414979.5 ns |
12333396 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9580459 ns |
9592791.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9268479.5 ns |
9267625 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18065125 ns |
18011459 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1952065 ns |
1959459 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17401166 ns |
17332937.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14384792 ns |
14386792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14373584 ns |
14369396.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21058416 ns |
21112291.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
136959 ns |
87708 ns |
1.56 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
88187.5 ns |
88542 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93625 ns |
92833 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
132479 ns |
116000 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126700 ns |
126352.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2048041.5 ns |
2022959 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1852917 ns |
2049666 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2026645.5 ns |
2035562.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026917 ns |
2025938 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
986444.5 ns |
878938 ns |
1.12 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
4083.5 ns |
2750 ns |
1.48 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2917 ns |
3209 ns |
0.91 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3000 ns |
3417 ns |
0.88 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2208 ns |
2792 ns |
0.79 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16657 ns |
16283 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2875 ns |
2542 ns |
1.13 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2708 ns |
2708 ns |
1 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3000 ns |
2875 ns |
1.04 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2792 ns |
2834 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
184699.5 ns |
176848 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7500 ns |
7083 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5959 ns |
6000 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
6041 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10042 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34237 ns |
34134 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226625 ns |
221583 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229917 ns |
220000 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220083 ns |
220417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
216458 ns |
215333 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
327734.5 ns |
285763.5 ns |
1.15 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22866 ns |
22875 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14709 ns |
14500 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14459 ns |
14375 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14666 ns |
14458 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14458 ns |
14500 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
457315.5 ns |
410580 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
138584 ns |
92125 ns |
1.50 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
91416.5 ns |
92916 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
97042 ns |
96979 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
140937 ns |
138000 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126314 ns |
125660 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1957979.5 ns |
1923792 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1909167 ns |
1935291 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1924104 ns |
1932916.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1923208 ns |
1920500 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
889594 ns |
861874.5 ns |
1.03 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
882875 ns |
873916 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
825333.5 ns |
826583 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1222459 ns |
1222000 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
968812 ns |
963750 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
275152 ns |
276546 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2832250 ns |
2791083 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2436875 ns |
2445687.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3346084 ns |
3347916 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3408479 ns |
3371375 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1574637 ns |
1487194.5 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17917 ns |
17250 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
14708 ns |
17959 ns |
0.82 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
16292 ns |
17875 ns |
0.91 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15416 ns |
17417 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
132847.5 ns |
130892 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
261666 ns |
218625 ns |
1.20 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215875 ns |
260667 ns |
0.83 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227834 ns |
227792 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
255708 ns |
256083 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
611860.5 ns |
584591.5 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
222416 ns |
222000 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
221458.5 ns |
222667 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
220979 ns |
222312.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221583 ns |
220833 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
248634.5 ns |
243596.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
577583 ns |
501417 ns |
1.15 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
510958.5 ns |
496084 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
528041.5 ns |
508541.5 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
518834 ns |
561833 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1309411.5 ns |
1202534 ns |
1.09 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
4125 ns |
3895.5 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
3459 ns |
4270.5 ns |
0.81 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
5041 ns |
5708 ns |
0.88 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
4084 ns |
4458.5 ns |
0.92 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17359 ns |
16584 ns |
1.05 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7500 ns |
7208.5 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7333 ns |
7000 ns |
1.05 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7333 ns |
7625 ns |
0.96 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7417 ns |
7500 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
183645 ns |
179332 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20500 ns |
17687 ns |
1.16 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16479.5 ns |
17917 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19063 ns |
18625 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19833 ns |
18729 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
145474 ns |
135434 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221208 ns |
211041 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212375 ns |
220417 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214166.5 ns |
212542 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224041.5 ns |
212271 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
906684.5 ns |
847267 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4312.5 ns |
3959 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
3916 ns |
4209 ns |
0.93 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
4708 ns |
4875 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3833 ns |
4291 ns |
0.89 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
198063 ns |
187480.5 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10750 ns |
10459 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11020.5 ns |
10541.5 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11042 ns |
10042 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10167 ns |
10125 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
997044 ns |
955985 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3708 ns |
3145.5 ns |
1.18 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3250 ns |
2937.5 ns |
1.11 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4375 ns |
4000 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3166 ns |
3167 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
217373.5 ns |
188520.5 ns |
1.15 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7834 ns |
7375 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7834 ns |
7209 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7625 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7333 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1011885.5 ns |
987324 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23840875 ns |
23406938 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
35027854 ns |
35765125 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37123250 ns |
37705500 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34879542 ns |
34946604 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1843048 ns |
1830206.5 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184325375 ns |
183995333 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
162219583 ns |
165575375 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
145745916.5 ns |
146468292 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
274832375 ns |
274483625 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16494818.5 ns |
16521685 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
274529917 ns |
276817937 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
253641458.5 ns |
246377395.5 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
231405354 ns |
231576042 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
324348208.5 ns |
325032833.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
184625 ns |
182896.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182292 ns |
184292 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185312.5 ns |
184958 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
184083 ns |
183167 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
202083.5 ns |
200810.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
645750 ns |
635333 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
603542 ns |
633354.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
599333 ns |
600291 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631333 ns |
597271 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
985206 ns |
958799 ns |
1.03 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3919209 ns |
3842750 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3919667 ns |
3997500 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3545875 ns |
3542792 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4578250 ns |
4556625 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
536849 ns |
532425 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17586375 ns |
17396104 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17890250 ns |
18078958 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16507834 ns |
16589917 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
19939000 ns |
19981167 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2632687 ns |
2633170 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32135 ns |
32094 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9625 ns |
8917 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9458 ns |
8750 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9583 ns |
9041 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8917 ns |
9042 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
248690.5 ns |
249030 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
653687729.5 ns |
652464437.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
392525041.5 ns |
394034604 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
328748959 ns |
326393417 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
745842709 ns |
748745833 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12472241 ns |
12466975 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1889224479.5 ns |
1885107791.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1635009041 ns |
1638827875 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1509526291.5 ns |
1512914354 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2196907667 ns |
2208603583.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49240921 ns |
49231175.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1653562.5 ns |
1616792 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1162084 ns |
1200917 ns |
0.97 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1386645.5 ns |
1389625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2484916 ns |
2477916.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215030.5 ns |
215338 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12723500 ns |
12691834 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9937604 ns |
9979354.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9655333 ns |
9689896 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18319042 ns |
18371271 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2044976 ns |
1985308 ns |
1.03 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17717375 ns |
17676916 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14639896 ns |
14722000 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14564292 ns |
14613667 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21414771 ns |
21413395.5 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26500 ns |
26292 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26291 ns |
26291 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24435 ns |
23721 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67542 ns |
67333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67208 ns |
67333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67792 ns |
67209 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67083 ns |
67333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
375943.5 ns |
367128.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203958 ns |
203542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210917 ns |
208625 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211667 ns |
209584 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199917 ns |
199792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26886 ns |
25494 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
665791.5 ns |
604625 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
634228.5 ns |
670666.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
663125 ns |
632166.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631979.5 ns |
630000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
330848.5 ns |
321975.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
660812.5 ns |
639021 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
641541.5 ns |
643458 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
644229 ns |
658750 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
672667 ns |
632750 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132735.5 ns |
131332 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2283708 ns |
2244229 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2228458 ns |
2277708.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2242479.5 ns |
2240167 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2245854 ns |
2235458.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1161748 ns |
1075922 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18458 ns |
17167 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19041.5 ns |
17916 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19583 ns |
18167 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20042 ns |
18208 ns |
1.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
133846.5 ns |
130720.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
263333 ns |
258584 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232125 ns |
227459 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266084 ns |
232750 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
260771 ns |
230791 ns |
1.13 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
976903 ns |
887768.5 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
708 ns |
666 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
24021 ns |
23104 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9979.5 ns |
9750 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9875 ns |
9250 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10083 ns |
9208 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9625 ns |
9417 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
250362 ns |
242418 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5792 ns |
5208 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5250 ns |
5125 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6667 ns |
6375 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5583 ns |
5375 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
210534.5 ns |
193804 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8041 ns |
7167 ns |
1.12 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7250 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7583 ns |
7375 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7042 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
742239 ns |
706410 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2167 ns |
2125 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2125 ns |
2250 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2375 ns |
2209 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2375 ns |
2208 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18253.5 ns |
17672 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6875 ns |
6458 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6375 ns |
6291 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6708 ns |
6709 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6500 ns |
6500 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
309540.5 ns |
300575 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
752500 ns |
749459 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746584 ns |
748959 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
751083.5 ns |
750854 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
749166.5 ns |
749167 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21684 ns |
20805 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
791542 ns |
775208 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
775083 ns |
795916.5 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
795500 ns |
792791 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
796417 ns |
792792 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
276627 ns |
274546.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7208 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5917 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5959 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
10250 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33842.5 ns |
33244 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
267708 ns |
219625 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
241125 ns |
240291 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
233083 ns |
237583 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
254500 ns |
260042 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
334966 ns |
337443 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10709 ns |
10084 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10125 ns |
9583 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11208 ns |
10750 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9875 ns |
10167 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
221448.5 ns |
223296.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24625 ns |
25125 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24854.5 ns |
24312.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25125 ns |
24917 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24833 ns |
24667 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1054466 ns |
1047460.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106442250 ns |
106018062.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
118292812 ns |
118144520.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120274792 ns |
120409292 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
118694500 ns |
117468833 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2661687 ns |
2652084 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
393734750 ns |
373672500 ns |
1.05 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
365034583 ns |
359102771.5 ns |
1.02 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
354260000.5 ns |
356068521.5 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
543249000 ns |
543525042 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15218732.5 ns |
15230726 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
784688333 ns |
605345333 ns |
1.30 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
760961125 ns |
584604208 ns |
1.30 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
748037208.5 ns |
744606604.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
614191479.5 ns |
793208583.5 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8292 ns |
6500 ns |
1.28 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6459 ns |
6375 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8625 ns |
8062 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6875 ns |
7146 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
215167.5 ns |
216878 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14208 ns |
13625 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14334 ns |
13625 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14708 ns |
14125 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14167 ns |
14084 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1007614.5 ns |
1010131 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6166 ns |
5625 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5375 ns |
6000 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7542 ns |
7895.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5959 ns |
5958 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
209886 ns |
211472.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13250 ns |
12583 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12334 ns |
12333 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13375 ns |
12708 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12625 ns |
12709 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
719093 ns |
725788 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5541 ns |
5583 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5208 ns |
5875 ns |
0.89 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6500 ns |
6583.5 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5625 ns |
6167 ns |
0.91 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16998 ns |
17002 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15792 ns |
15916 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15458 ns |
15250 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15667 ns |
16125 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15375 ns |
15834 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
185360 ns |
187784.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
334 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23196 ns |
23531 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6167 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6292 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6791 ns |
6459 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6062.5 ns |
6084 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
225872.5 ns |
228744 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5834 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5916 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5959 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5959 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24541.5 ns |
24273 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21416 ns |
20833 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21500 ns |
20750 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21750 ns |
21292 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21708 ns |
21041 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
248924 ns |
251207.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144166.5 ns |
185375 ns |
0.78 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
143812.5 ns |
144625 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
147187.5 ns |
147917 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
191416 ns |
144417 ns |
1.33 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167441 ns |
166909.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1359562.5 ns |
1321833 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1324500 ns |
1350479 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1321625 ns |
1337166 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1323416 ns |
1323625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1302768.5 ns |
1251196 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24208 ns |
24833 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21916 ns |
25041 ns |
0.88 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25000 ns |
23958 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24834 ns |
24271 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
324085 ns |
315591 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
174729 ns |
131292 ns |
1.33 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
119604 ns |
118396 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
179041.5 ns |
176916 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
178833 ns |
129458 ns |
1.38 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1333498 ns |
1353120 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23177 ns |
23127 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6584 ns |
6125 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6541 ns |
6459 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6916 ns |
6333 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6459 ns |
6125 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
242684 ns |
245064.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4771 ns |
4208 ns |
1.13 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4583 ns |
4875 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
5125 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4583 ns |
4667 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
227724.5 ns |
228957.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
9875 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10416 ns |
9875 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10417 ns |
10334 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10125 ns |
10208 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1265317 ns |
1285818.5 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1584 ns |
1584 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1667 ns |
1625 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23073 ns |
23344 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5750 ns |
5750 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5875 ns |
5709 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6000 ns |
6000 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5667 ns |
5666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
260330.5 ns |
264086.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6829416.5 ns |
6807541.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6373750.5 ns |
6433375 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6530583 ns |
6489875 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7628125 ns |
7649521 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214315 ns |
214938 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24071417 ns |
24073959 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21308167 ns |
21296000 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
20999125 ns |
21044062.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29763520.5 ns |
29805771 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2117367 ns |
2104181 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48706709 ns |
37247625 ns |
1.31 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45307041 ns |
34089791 ns |
1.33 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45597625 ns |
45725979.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
38094041.5 ns |
49397750 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6542 ns |
5500 ns |
1.19 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5666 ns |
5708 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6834 ns |
6541 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5979.5 ns |
5708 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
209381.5 ns |
208256 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
8084 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8125 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8916 ns |
8375 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8375 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
976901 ns |
991485 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1572791 ns |
1509000 ns |
1.04 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1254791 ns |
1282542 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1624459 ns |
1634916.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2178625 ns |
2162000.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
273695 ns |
271116.5 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7953208 ns |
7902209 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6560458 ns |
6449312.5 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7149104 ns |
7195708 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10472292 ns |
10462229 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1791343 ns |
1752716.5 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
369500 ns |
371187.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
374937.5 ns |
374208 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
455625 ns |
461250 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
23937.5 ns |
22208 ns |
1.08 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
47162 ns |
42428.5 ns |
1.11 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
737042 ns |
745437.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
809667 ns |
815833 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1061750 ns |
1062958 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
118833 ns |
117396 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
286422.5 ns |
283256.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397500 ns |
397208 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288042 ns |
288667 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287958 ns |
287875 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750083 ns |
750917 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44452 ns |
43636 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
676958 ns |
667000 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
532500 ns |
531375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
530042 ns |
531417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
974250 ns |
974083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
193188.5 ns |
188745 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
683375 ns |
644833 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
641167 ns |
648750 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
649375 ns |
644479 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
673812 ns |
652458.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132770 ns |
131347.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2487917 ns |
2445334 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2450875 ns |
2500021 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2451584 ns |
2463250 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2463958 ns |
2463375 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1259284 ns |
1238313 ns |
1.02 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
4209 ns |
3417 ns |
1.23 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
3750 ns |
3625 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4270.5 ns |
4250 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
4667 ns |
3437.5 ns |
1.36 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16229 ns |
16066 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5583 ns |
5375 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5375 ns |
5292 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5625 ns |
5750 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5583 ns |
5583 ns |
1 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
186384.5 ns |
182995 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1459666 ns |
1458042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1502500 ns |
1499750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1501291 ns |
1503250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1437500 ns |
1437708 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40589 ns |
40191 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5161542 ns |
5113291 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5284458 ns |
5287958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5290208 ns |
5307041.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4991020.5 ns |
4985125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198827.5 ns |
196599 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34039 ns |
33557 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15416 ns |
15125 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15417 ns |
15167 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15500 ns |
15416 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15209 ns |
15208 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
356561 ns |
349206 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71667 ns |
71125 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71208 ns |
71542 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71167 ns |
71209 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71250 ns |
71041 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
114103.5 ns |
113114 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
321500 ns |
317667 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
318209 ns |
324125 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318791 ns |
318292 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318042 ns |
317625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
196364 ns |
193277 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1125 ns |
958 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1041 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1084 ns |
1083 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1125 ns |
0.89 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
24058 ns |
23048 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
7750 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8041 ns |
8270.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8250 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8041 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
250219.5 ns |
245757.5 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
511249.5 ns |
502770.5 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
491875 ns |
484500 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
563375 ns |
561750 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
213208 ns |
219917 ns |
0.97 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129570 ns |
129178 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1422313 ns |
1387645.5 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1464542 ns |
1473958 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1723104.5 ns |
1779041.5 ns |
0.97 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
871458 ns |
862917 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
278287 ns |
273950 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
417 ns |
333 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32706 ns |
31657.5 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6791 ns |
6125 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6208 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6334 ns |
6541 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6062.5 ns |
6042 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
254343.5 ns |
251419 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1757021 ns |
1733792 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1724500 ns |
1721208 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1725541.5 ns |
1724250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1778395.5 ns |
1773541 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169310.5 ns |
168671 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4406167 ns |
4114542 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4294542 ns |
4392834 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4359458.5 ns |
4368208.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4362750 ns |
4369208.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1097195 ns |
1291475.5 ns |
0.85 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6917 ns |
6834 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6667 ns |
6667 ns |
1 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6917 ns |
7999.5 ns |
0.86 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6895.5 ns |
7041 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20814 ns |
20138.5 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
51833 ns |
51250 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
51187.5 ns |
32625 ns |
1.57 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
48833 ns |
73833 ns |
0.66 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51875 ns |
51084 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
199815.5 ns |
340107 ns |
0.59 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
18125 ns |
17833 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
17625 ns |
18083 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18833 ns |
18875 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
18042 ns |
18208 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18986 ns |
18400 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53167 ns |
53250 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53375 ns |
53041 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53417 ns |
53375 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53583 ns |
53542 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
321788.5 ns |
319083.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75708 ns |
75166 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75292 ns |
75625 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
74958 ns |
75291.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75333 ns |
75083 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47411 ns |
47469 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
330916.5 ns |
324958 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
331041.5 ns |
342000 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
324375 ns |
325000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
324541 ns |
324542 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
214495.5 ns |
211595 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1486458 ns |
1484959 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1529333 ns |
1526854.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1528292 ns |
1527250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1463917 ns |
1462542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52412 ns |
51799 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5147625 ns |
5111083.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5283458 ns |
5312417 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5285042 ns |
5299333.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4987292 ns |
4982354 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
208778.5 ns |
204934 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28250 ns |
28208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28209 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28333 ns |
28187.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28208 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
25229 ns |
24742 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66750 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66583 ns |
66709 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66500 ns |
66500 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66750 ns |
66541 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
504373.5 ns |
484630.5 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1502708.5 ns |
1480583.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1140542 ns |
1136563 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1124125 ns |
1136750 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2213416 ns |
2265937.5 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
588762 ns |
579622.5 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3119667 ns |
3074562.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2731437.5 ns |
2788145.5 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2740375 ns |
2743021 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3821208 ns |
3819500.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2025801 ns |
1931643 ns |
1.05 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7950666 ns |
7902458 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7904604 ns |
7834062.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7893208 ns |
7920375 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4813854.5 ns |
4826312.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
81708 ns |
77625 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80167 ns |
81167 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83042 ns |
84041.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
130291.5 ns |
111396 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193963.5 ns |
193746 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2045417 ns |
2012875 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2014334 ns |
2046292 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2016958 ns |
2031354 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2019542 ns |
2015417 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
751114 ns |
746361.5 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #1131
needs EnzymeAD/Reactant.jl#448