-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: use [sources] in Project.toml #1090
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 855ff5b | Previous: 3986545 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4084 ns |
3792 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4125 ns |
4084 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5083.5 ns |
4834 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4000 ns |
3959 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60614.5 ns |
61509.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10458 ns |
10500 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10666 ns |
10541 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11125 ns |
10250 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10583 ns |
10250 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
424464.5 ns |
431498.5 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1250 ns |
1062.5 ns |
1.18 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1250 ns |
1167 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1375 ns |
1417 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1167 ns |
1208 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18114 ns |
18573 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4083 ns |
4000 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3792 ns |
4000 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4375 ns |
4209 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3875 ns |
3750 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
109410 ns |
111184 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57958 ns |
57750 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46375 ns |
38542 ns |
1.20 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38292 ns |
46583 ns |
0.82 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81375 ns |
82208 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37166 ns |
37503.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2025083 ns |
2037645.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2089000 ns |
2095625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2116042 ns |
1844375 ns |
1.15 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1997083 ns |
2001375 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196065 ns |
196039 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144167 ns |
145583 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144459 ns |
143584 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
145167 ns |
146458 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
145041 ns |
145000 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165775.5 ns |
168190 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1117334 ns |
1114291 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1113209 ns |
1150292 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1161895.5 ns |
805500 ns |
1.44 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1119208 ns |
1122750 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
529787.5 ns |
526921 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3500 ns |
3292 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3645.5 ns |
3666 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4125 ns |
4167 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3291 ns |
3500 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
68193 ns |
72235.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8917 ns |
10125 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8708 ns |
8375 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9375 ns |
8792 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
8833 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
490128 ns |
480020 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14875 ns |
14875 ns |
1 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15417 ns |
15000 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18042 ns |
17520.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14958 ns |
14583 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
54649 ns |
53914 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213375 ns |
214792 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213541 ns |
214875 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214167 ns |
214750 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215500 ns |
226813 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
275612.5 ns |
272785 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
917 ns |
0.82 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
459 ns |
1.36 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17498 ns |
17774 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1667 ns |
1792 ns |
0.93 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1750 ns |
1417 ns |
1.24 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1916 ns |
1709 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1500 ns |
1417 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
103477.5 ns |
102929.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7167 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5250 ns |
1.10 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5375 ns |
6000 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10000 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24119 ns |
23666 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222875 ns |
225187.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
240875 ns |
237479.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230208 ns |
229334 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213854.5 ns |
226709 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
171815 ns |
168739 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3959 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3875 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23882 ns |
23839 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16875 ns |
16792 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16708 ns |
16833 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16917 ns |
16958 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16875 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
163766 ns |
161365 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
573417 ns |
571458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
577417 ns |
576000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
573666 ns |
574041 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
576291 ns |
571458 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113520 ns |
113559.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1423416 ns |
1425375 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1417833 ns |
1418875 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1427333.5 ns |
1418958 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1419833 ns |
1422750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
212938 ns |
210833 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1073958 ns |
1076645.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
970834 ns |
934291 ns |
1.04 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1321166 ns |
1340187.5 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1305000 ns |
1294270.5 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
275497.5 ns |
271656 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5791125 ns |
5796417 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4570250 ns |
4651792 ns |
0.98 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4945833.5 ns |
4918209 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5716292 ns |
5515938 ns |
1.04 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1090457 ns |
1071316.5 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
541 ns |
583 ns |
0.93 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
500 ns |
1.17 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23889 ns |
23948.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2167 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2209 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2208 ns |
2125 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
171739 ns |
169153 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4500 ns |
3625 ns |
1.24 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4208 ns |
4084 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5000 ns |
4687.5 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3625 ns |
3709 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
66260 ns |
66303.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11625 ns |
11270.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11750 ns |
11417 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12333 ns |
11625 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11375 ns |
10667 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
453873.5 ns |
456550 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7604 ns |
6312.5 ns |
1.20 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6792 ns |
6770.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8250 ns |
7792 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5916 ns |
7083 ns |
0.84 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53506 ns |
52528 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17375 ns |
18375 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16958 ns |
17833 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18042 ns |
17791 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16667 ns |
16833 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
305015.5 ns |
301396 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
584 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33251 ns |
32972 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
9020.5 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
8459 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9333 ns |
9041 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8542 ns |
8708 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
161099 ns |
159042.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64458 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64583 ns |
64895.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64875 ns |
64292 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64583 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112304 ns |
110877 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
279208.5 ns |
284875 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
284708 ns |
297937.5 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
293834 ns |
282333 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
285833 ns |
274104.5 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
187730 ns |
184904.5 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3285249.5 ns |
3295541 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3021125 ns |
2811062.5 ns |
1.07 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
2778812.5 ns |
3016125 ns |
0.92 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
3955167 ns |
3935209 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
579536 ns |
572132 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7633895.5 ns |
7478250 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7435145.5 ns |
7348937.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7391584 ns |
7339479.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8220833 ns |
8212959 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1366510.5 ns |
1367334 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
18924416 ns |
18775625 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
13838542 ns |
19121334 ns |
0.72 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19308875 ns |
19108667 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15678458 ns |
15653542 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23458500 ns |
23560250 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33511625 ns |
42472875 ns |
0.79 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
41458083 ns |
37127771 ns |
1.12 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35090271 ns |
34865500 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1856909 ns |
1862818 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
189151500 ns |
188025167 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164154750 ns |
176960479.5 ns |
0.93 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
160445604.5 ns |
152823708 ns |
1.05 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
441242500 ns |
441336000 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13896361 ns |
13912250 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
289233042 ns |
290589750 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
337563458 ns |
276449542 ns |
1.22 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
307343396 ns |
296753875 ns |
1.04 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
412183687 ns |
333259041 ns |
1.24 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24334 ns |
22875 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24416.5 ns |
23333 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25167 ns |
24125 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23917 ns |
23542 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97448 ns |
98041.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103042 ns |
103625 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
103520.5 ns |
135834 ns |
0.76 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104750 ns |
105084 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103083 ns |
103250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
508014 ns |
518052 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
6209 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6917 ns |
6500 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6833.5 ns |
7041.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5959 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68441 ns |
70884 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14250 ns |
15084 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15875 ns |
15708 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16750 ns |
16250 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15041 ns |
14770.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
473840 ns |
492747 ns |
0.96 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3011021 ns |
3001020.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2060083.5 ns |
2085333 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2299209 ns |
2274000 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4865083 ns |
4550083 ns |
1.07 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
586467.5 ns |
589071 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23471917 ns |
23511750 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18065458 ns |
18279542 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18230125 ns |
16979209 ns |
1.07 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35635667 ns |
35598583 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2895396 ns |
3111231 ns |
0.93 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33250875 ns |
33266500 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27738541 ns |
28064750 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28088500 ns |
27365500 ns |
1.03 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41784729 ns |
41824541.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
71875 ns |
71750 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
84916 ns |
74021 ns |
1.15 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75083 ns |
74875 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72729 ns |
73458 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104782 ns |
104698 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
204229 ns |
314125.5 ns |
0.65 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
207333.5 ns |
212229 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
285666 ns |
323000 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
232875 ns |
218042 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
561543 ns |
559024 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11708 ns |
11625 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12417 ns |
12292 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13000 ns |
12500 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11750 ns |
11875 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
72717 ns |
73943 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26958 ns |
26583 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27417 ns |
26667 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28416 ns |
27708 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27229.5 ns |
26666 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
481917.5 ns |
493150 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12625 ns |
12208 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12958 ns |
12896 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14000 ns |
13916 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12167 ns |
12500 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
54368 ns |
54608 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26334 ns |
26125 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26375 ns |
26000 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26375 ns |
25916.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27708 ns |
26000 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
309823 ns |
315887.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182042 ns |
179208 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182250 ns |
183145.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
183604 ns |
183166 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
181500 ns |
180125 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
57447.5 ns |
58575 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
585499.5 ns |
582958.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
594896 ns |
596541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
586042 ns |
583833 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
582125 ns |
582834 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
293208 ns |
294599.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
6292 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6500 ns |
6459 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6854 ns |
6750 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6041 ns |
6041 ns |
1 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
72078 ns |
72806 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14541 ns |
14542 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14917 ns |
13333 ns |
1.12 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16125 ns |
15667 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14625 ns |
14333 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
474091.5 ns |
482192.5 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1281167 ns |
1177728.5 ns |
1.09 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1208125 ns |
1356208.5 ns |
0.89 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1280625 ns |
1250750 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1322708 ns |
1317541 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301459 ns |
301448 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4101146 ns |
4117688 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4348917 ns |
4491417 ns |
0.97 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4741500 ns |
4696854.5 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4439333 ns |
4452542 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1050882 ns |
1051206.5 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1917 ns |
1875 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23723 ns |
24165 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4958 ns |
5000 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4958 ns |
4958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4875 ns |
4917 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4875 ns |
4875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
189158.5 ns |
194564.5 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5792 ns |
6041 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6333 ns |
6000 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7083 ns |
6145.5 ns |
1.15 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5209 ns |
5958 ns |
0.87 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
56549.5 ns |
57313.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11875 ns |
11979.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11041 ns |
11854.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11895.5 ns |
11042 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10229.5 ns |
11292 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
333873.5 ns |
342366 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
334 ns |
333 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22961 ns |
23004 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3000 ns |
3000 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2750 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3083 ns |
3000 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
2750 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
158273 ns |
159207 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11604.5 ns |
11583 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11875 ns |
11292 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12958 ns |
13437.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11208 ns |
11708.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
57126.5 ns |
57286.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25000 ns |
25312.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25167 ns |
25083 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25250 ns |
25334 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24917 ns |
25167 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
295873 ns |
296722 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4208 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4250 ns |
4167 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24965 ns |
25099 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16250 ns |
16125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16166 ns |
16041 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16125 ns |
16166 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16042 ns |
16042 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
196566.5 ns |
199370.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5833 ns |
5833 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5833 ns |
5833 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5833 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34029 ns |
33986 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20917 ns |
21083 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21250 ns |
21125 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
20709 ns |
21208 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21000 ns |
20667 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
177138.5 ns |
176941.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
397334 ns |
396792 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
382166.5 ns |
354313 ns |
1.08 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
474896 ns |
489167 ns |
0.97 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
521625 ns |
521584 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67116 ns |
66831 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
985000 ns |
1005417 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
855645.5 ns |
876583 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1221812.5 ns |
1235667 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1321458.5 ns |
1420854 ns |
0.93 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
190549 ns |
191762.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80250 ns |
80250 ns |
1 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80625 ns |
80209 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
81833 ns |
84167 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80458.5 ns |
81125 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192943 ns |
193433 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1924750 ns |
1916083 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1908104 ns |
1933854 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1708583 ns |
1917917 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1920583.5 ns |
1923708.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
399820 ns |
409629 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22084 ns |
22197 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
169250 ns |
170854.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6083 ns |
6791 ns |
0.90 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7084 ns |
6417 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7937.5 ns |
7375 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6333 ns |
6959 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60491 ns |
61202 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9291.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
9166.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9541 ns |
9375 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9042 ns |
9334 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
310109.5 ns |
313492.5 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
118286646 ns |
120748834 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174064625 ns |
181703729 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
156462125 ns |
148437750 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106578500 ns |
104851584 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5473382.5 ns |
5474996 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
614571333 ns |
616853125 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
555423417 ns |
579539270.5 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
470293000 ns |
451846854.5 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
760588104.5 ns |
757165312.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35131353 ns |
34944567 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
650061834 ns |
649889209 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
665653562.5 ns |
688661771 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
595687063 ns |
592710229 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
744750500 ns |
741917708 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59167 ns |
59750 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46750 ns |
38959 ns |
1.20 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38750 ns |
48000 ns |
0.81 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83459 ns |
83416 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37444 ns |
37459 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1930479 ns |
1922792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1983354 ns |
1985083 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1816458.5 ns |
1978104 ns |
0.92 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1895083 ns |
1893917 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
175426.5 ns |
174160 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
285562 ns |
290625 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
280875 ns |
266708 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
268125 ns |
271521 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
268209 ns |
268167 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
130965 ns |
132776.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
592333 ns |
657229.5 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
692458 ns |
681187.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
693521 ns |
691583 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
601042 ns |
597417 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
701811.5 ns |
713916 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2222375.5 ns |
2243937 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2219833 ns |
2191895.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2137416 ns |
2213542 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2174875 ns |
2180437.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133582 ns |
133381 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5523042 ns |
5496875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5512333 ns |
5583292 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5594937 ns |
5498250 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5490208 ns |
5492750.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
738932 ns |
753967 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
642209 ns |
636833 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
637875 ns |
644417 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
637625 ns |
645333 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
639000 ns |
637292 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47189.5 ns |
46993.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1826916 ns |
1826042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1718042 ns |
1667083 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1663958 ns |
1726542 ns |
0.96 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2104375 ns |
2105854.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
220556 ns |
222295 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58875 ns |
58500 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
45167 ns |
38708 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38500 ns |
47250 ns |
0.81 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83584 ns |
84292 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28746 ns |
28598 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027229 ns |
2031041 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2081437.5 ns |
2099020.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2087875 ns |
2091916.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1994833 ns |
1856417 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190232.5 ns |
190652 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13337625 ns |
13391395.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12443750 ns |
12453250 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12593792 ns |
12557375.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15201937 ns |
15140541 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
517591 ns |
514312 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47226542 ns |
47481750 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41816250 ns |
41986250 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41295625 ns |
40944792 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58300709 ns |
57945917 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3048320 ns |
3259544 ns |
0.94 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97261354 ns |
96867229.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
68631000 ns |
91436187.5 ns |
0.75 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
91621375 ns |
90591917 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99015250 ns |
76381625 ns |
1.30 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59125 ns |
59083.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47250 ns |
38750 ns |
1.22 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38500 ns |
47417 ns |
0.81 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81750 ns |
84000 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47183 ns |
46955 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1926041 ns |
1925125 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1963020.5 ns |
1979250 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1903354.5 ns |
1970729.5 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1890791.5 ns |
1897750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
192410.5 ns |
191790.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32519 ns |
32566 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6417 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6250 ns |
6458 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6520.5 ns |
6459 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6209 ns |
6083 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
172802 ns |
174123.5 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31540 ns |
31409 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2834 ns |
2833 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2667 ns |
2791 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2916 ns |
2834 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2625 ns |
2583 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
160546.5 ns |
161269 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
285419125.5 ns |
286258979.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340019750 ns |
346927270.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
321915833.5 ns |
313997291.5 ns |
1.03 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
276611333 ns |
270108416 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7104026 ns |
7104986 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
983693667 ns |
998016667 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
937864666 ns |
959348209 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
869633271 ns |
851652541.5 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1164063416 ns |
1162498166 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33938699 ns |
33999768 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1680134083 ns |
1672427541 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1330764145.5 ns |
1705785000 ns |
0.78 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1621211875 ns |
1631619209 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1681717458 ns |
1314128542 ns |
1.28 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1405437.5 ns |
1406813 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1406833 ns |
1416875 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1410083 ns |
1459625 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1408667 ns |
1407750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127501 ns |
127789 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5022916.5 ns |
5022896 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5014916.5 ns |
5051333 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5074667 ns |
5029542 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5020083.5 ns |
5031875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
549193 ns |
559312.5 ns |
0.98 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
174892687.5 ns |
169600250 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
127003959 ns |
180340396 ns |
0.70 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
149797333.5 ns |
130036124.5 ns |
1.15 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
155817792 ns |
169790708.5 ns |
0.92 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4879989.5 ns |
5056885.5 ns |
0.97 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
621407125 ns |
669854958 ns |
0.93 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
648841042 ns |
604244667 ns |
1.07 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
575120333 ns |
501867209 ns |
1.15 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
681424083 ns |
684062709 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
15898555 ns |
16520518 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8876291.5 ns |
8950666 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8737625 ns |
8876958.5 ns |
0.98 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
8213750.5 ns |
7849458.5 ns |
1.05 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10160729 ns |
10185417 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1593806 ns |
1594436 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
35769916 ns |
36026541.5 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
36822375.5 ns |
38047792 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
34657916 ns |
33343417 ns |
1.04 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38863625 ns |
38792000 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6473421 ns |
6457988 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47500 ns |
47417 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47250 ns |
47375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47583 ns |
47584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47292 ns |
47333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18826 ns |
18535 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50291 ns |
50291 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50458 ns |
50375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50854.5 ns |
50417 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50333 ns |
50083 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
198991.5 ns |
191873 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7083 ns |
6458 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7584 ns |
6917 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7917 ns |
7750 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6458 ns |
6958 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
97827 ns |
91345 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10416 ns |
10458 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
9916 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9958 ns |
10084 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
10208 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
550910 ns |
527140.5 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5812.5 ns |
5625 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6333 ns |
5917 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8125 ns |
6958 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5541 ns |
5750 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
105892 ns |
120543 ns |
0.88 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13792 ns |
13583 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13042 ns |
13354.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13584 ns |
13458 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13375 ns |
13000 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
499431 ns |
537999 ns |
0.93 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1125 ns |
1042 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
33256 ns |
32473 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
7917 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7958 ns |
7917 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8167 ns |
7959 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8084 ns |
8167 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
209965 ns |
206314.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23125 ns |
23437.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23083 ns |
23167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23250 ns |
23584 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23167 ns |
23542 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18915 ns |
18671 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52625 ns |
52458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52584 ns |
52541 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53000 ns |
53458 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52854.5 ns |
52062.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
277479.5 ns |
291832.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1458417 ns |
1458937 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1400208.5 ns |
1401583 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1405167 ns |
1403833.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1426958.5 ns |
1459708.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196215 ns |
195968 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5014875 ns |
5008771 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4947750 ns |
5044104 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4722875 ns |
5017250 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5007813 ns |
5011916 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
607506 ns |
599687 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3062917 ns |
3061000 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2059667 ns |
2086750 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2288458.5 ns |
2304917 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4925875 ns |
4539041 ns |
1.09 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
577404 ns |
581670 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24414833.5 ns |
24376958 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18846271 ns |
19122667 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
19151917 ns |
19181062.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36770687.5 ns |
36163041 ns |
1.02 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2986349 ns |
3185287.5 ns |
0.94 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34017271 ns |
34039875 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28311000 ns |
28717291.5 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28446000 ns |
28156000 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41721770.5 ns |
41614584 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144322334 ns |
144831583 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
142519708 ns |
143542708 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
125442687.5 ns |
124983229.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
173816042 ns |
173618479 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22556374 ns |
22558463 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1298683979 ns |
1247182979 ns |
1.04 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
886917541 ns |
836595146 ns |
1.06 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
716672958 ns |
738893583 ns |
0.97 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
736020312.5 ns |
672803125 ns |
1.09 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
117170580 ns |
118329511 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
83729 ns |
84666 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74583.5 ns |
73666 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76584 ns |
76146 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74125 ns |
75688 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
239252.5 ns |
240753.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
290521 ns |
287042 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
277645.5 ns |
212354 ns |
1.31 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
200875 ns |
296854 ns |
0.68 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
288958.5 ns |
284250 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1205034 ns |
1238105 ns |
0.97 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35458291 ns |
35497979 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35474646 ns |
35870917 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32477292 ns |
32110833 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40963750 ns |
40961896 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5849344 ns |
5843453.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
148280375 ns |
149169500 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
153499458.5 ns |
155980437.5 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
141027625 ns |
134845625 ns |
1.05 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
288738292 ns |
287434667 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34899034 ns |
34879809 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121978521.5 ns |
121767709 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174111417 ns |
181613625 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
157126854 ns |
148039291 ns |
1.06 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
104267125.5 ns |
104612333.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5468756 ns |
5485164 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
470236083 ns |
472118833 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
467564916 ns |
486130458.5 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
457529645.5 ns |
440650208 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
742702979.5 ns |
746192375 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32271141 ns |
32245076 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
708676479.5 ns |
643396416 ns |
1.10 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
654437437.5 ns |
675303249.5 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
584295500 ns |
575492166 ns |
1.02 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
854241125 ns |
856961334 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1340625 ns |
1312541 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
955312.5 ns |
677667 ns |
1.41 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
751500 ns |
963459 ns |
0.78 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2049958 ns |
2093375 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
573022 ns |
580070.5 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2963229 ns |
2966541.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2574541 ns |
2496854 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2393666.5 ns |
2623959 ns |
0.91 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3690375 ns |
3704083 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1694080 ns |
1730505 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6647083 ns |
6656375 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6495583 ns |
6477624.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6493333 ns |
6431167 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4435562.5 ns |
4450479.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7458 ns |
7375 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5959 ns |
5417 ns |
1.10 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
6084 ns |
0.86 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
9917 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25179 ns |
25252 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212417 ns |
212583 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219708 ns |
229770.5 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221958 ns |
220500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206542 ns |
206083 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
241960 ns |
251646.5 ns |
0.96 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
312753979 ns |
301644020.5 ns |
1.04 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
215397229 ns |
280942354.5 ns |
0.77 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
218945999.5 ns |
189363792 ns |
1.16 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
313831042 ns |
305392479 ns |
1.03 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7904498 ns |
7676597 ns |
1.03 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1087078708 ns |
1087372208.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
909298979 ns |
980974208 ns |
0.93 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
871782083.5 ns |
865965209 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1162821625 ns |
1158600916.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
27057892 ns |
26533591 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5687.5 ns |
5354.5 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5458 ns |
5375 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7084 ns |
6917 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5417 ns |
4958 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
133527.5 ns |
146657 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7395.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7084 ns |
7375 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7250 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7584 ns |
7250 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
563771.5 ns |
596011.5 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
625 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23871 ns |
24031 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9333 ns |
8917 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9291 ns |
9708 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
9583 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8667 ns |
8833 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
208765 ns |
216620.5 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
352083 ns |
353333 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
352083 ns |
352041 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
353791.5 ns |
352666.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
351833.5 ns |
352417 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21220 ns |
21463 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
824895.5 ns |
820625 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
777334 ns |
828917 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
831875 ns |
774875 ns |
1.07 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
824062.5 ns |
778729 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
254482.5 ns |
269469 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
332771 ns |
337187.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
342541 ns |
313687.5 ns |
1.09 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
441895.5 ns |
444709 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
330125 ns |
334500 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18025 ns |
17922 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
690541 ns |
689958 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
744041.5 ns |
746333 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1045312.5 ns |
1025042 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
688708 ns |
694854.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
229447 ns |
242950 ns |
0.94 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
347291 ns |
351417 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
349333.5 ns |
327270.5 ns |
1.07 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
436667 ns |
414729.5 ns |
1.05 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
371000 ns |
371750 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22527 ns |
22559 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
750021 ns |
747208 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
746646 ns |
749416 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1076187.5 ns |
1069374.5 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
818562.5 ns |
815937.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
208987 ns |
224503 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3792 ns |
3708 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3542 ns |
3625 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3583 ns |
3291 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17734 ns |
17855 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4459 ns |
4208 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4625 ns |
4208 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4250 ns |
4333 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4250 ns |
4208 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
233318 ns |
248489.5 ns |
0.94 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4042 ns |
3708 ns |
1.09 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3958 ns |
4167 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4208 ns |
4791 ns |
0.88 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3667 ns |
3792 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
176017 ns |
203806 ns |
0.86 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8792 ns |
8667 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8541 ns |
8250 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8750 ns |
8458 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8667 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1103718 ns |
1166315.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204375 ns |
204875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209958 ns |
209750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209791 ns |
209834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199042 ns |
200000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34881 ns |
34893 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
647417 ns |
602917 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
667104.5 ns |
628833 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
648250 ns |
621584 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
630437.5 ns |
592041 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
347074.5 ns |
321942.5 ns |
1.08 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
963666 ns |
978791 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
935292 ns |
937250.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
967916 ns |
960250 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1291958 ns |
1307271 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
208614 ns |
207418 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4493917 ns |
4504084 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4486500 ns |
4619604.5 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4473646 ns |
4294917 ns |
1.04 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6242041.5 ns |
6229292 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
938155.5 ns |
936037 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3125 ns |
3354 ns |
0.93 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3500 ns |
3583 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4375 ns |
4417 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3000 ns |
3333 ns |
0.90 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
231776.5 ns |
196464 ns |
1.18 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7708 ns |
7334 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
7417 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7583 ns |
7291 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7333 ns |
6917 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1010539 ns |
985634 ns |
1.03 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1636750 ns |
1640792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1187333 ns |
1171541.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1380354 ns |
1327125 ns |
1.04 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2361520.5 ns |
2384666 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214275 ns |
216205.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12321562.5 ns |
12345499.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9589792 ns |
9603042 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9486250 ns |
9259895.5 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18012208.5 ns |
18032958.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1948219 ns |
1950941 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17356062.5 ns |
17348083 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14365000 ns |
14444583.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14592667 ns |
14302167 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21122541 ns |
21057645.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88041 ns |
87666.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
89542 ns |
89562 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
98792 ns |
90292 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
133292 ns |
88875 ns |
1.50 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126258 ns |
126565 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2030917 ns |
2024000 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2007375 ns |
2030958.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2054917 ns |
1707583 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2027125 ns |
2030042 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1023675.5 ns |
999913 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
339083.5 ns |
343750 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
346896 ns |
326145.5 ns |
1.06 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
419438 ns |
396833 ns |
1.06 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
306854 ns |
309896 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16039.5 ns |
16654 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
704834 ns |
702666 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
734708 ns |
733666 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1027250 ns |
1020166 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
642979 ns |
652500 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
190568 ns |
190386.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7416 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5291 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
6000 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10041 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34186 ns |
34743 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
224958 ns |
224334 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220812.5 ns |
229333 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228020.5 ns |
220959 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218645.5 ns |
206292 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
310606 ns |
296926 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3792 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22651 ns |
23083 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14459 ns |
14416 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14416 ns |
14209 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14208 ns |
14292 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14500 ns |
14458 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
462091 ns |
448235 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
137521 ns |
92854 ns |
1.48 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
92958 ns |
99583 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
98583 ns |
94542 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
135083 ns |
96042 ns |
1.41 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125583 ns |
125978 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1927479 ns |
1920562.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1907708.5 ns |
1914937.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1953208 ns |
1653792 ns |
1.18 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1921667 ns |
1928541 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
948359 ns |
893203 ns |
1.06 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
864916 ns |
878750 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
824896 ns |
800021 ns |
1.03 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1172959 ns |
1221729 ns |
0.96 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
959583 ns |
963792 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
277617.5 ns |
277692.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2822000 ns |
2824834 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2517479 ns |
2464958 ns |
1.02 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3358584 ns |
3323271 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3372271 ns |
3398958 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1588161 ns |
1565101.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17708 ns |
17667 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17729 ns |
15458.5 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18125 ns |
17250.5 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16770.5 ns |
14645.5 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
145282 ns |
142432.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226562.5 ns |
218209 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
264687 ns |
222958.5 ns |
1.19 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
224583.5 ns |
216334 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
260834 ns |
215062.5 ns |
1.21 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
640324 ns |
637432 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
220083.5 ns |
221145.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222125 ns |
222375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222541.5 ns |
220917 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
223209 ns |
220333 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
272584.5 ns |
280530 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
557208.5 ns |
510354 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
495917 ns |
499375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
508083 ns |
500021 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
523021 ns |
507041 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1311116 ns |
1281236 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
324084 ns |
332250 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
333875 ns |
316000 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
441292 ns |
364333 ns |
1.21 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
320124.5 ns |
323834 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17097 ns |
17441 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
707729 ns |
715833.5 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
729728.5 ns |
735083 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1025875 ns |
1022959 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
661375 ns |
667041 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
194989.5 ns |
193588.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18250 ns |
18666 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19479 ns |
17375 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19687.5 ns |
19167 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19354 ns |
17083.5 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
149984.5 ns |
147781 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221000 ns |
212542 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
250875 ns |
214146 ns |
1.17 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
253604 ns |
213834 ns |
1.19 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
239125.5 ns |
211354.5 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
972417 ns |
877964 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4000 ns |
4083 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4583 ns |
4291.5 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5417 ns |
5375 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4334 ns |
3958 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
242942 ns |
169898 ns |
1.43 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11000 ns |
10834 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10458 ns |
10542 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11042 ns |
10583 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10459 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1104793 ns |
993411.5 ns |
1.11 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3000 ns |
3417 ns |
0.88 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3667 ns |
3167 ns |
1.16 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4166.5 ns |
4375 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3375 ns |
3062.5 ns |
1.10 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
245420.5 ns |
203556.5 ns |
1.21 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7708 ns |
7791 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7584 ns |
7458 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
7250 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7417 ns |
7541 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1110479 ns |
1041955 ns |
1.07 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23377292 ns |
23557729 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34613667 ns |
43140979 ns |
0.80 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
41416917 ns |
37880833 ns |
1.09 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34914458 ns |
34954917 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1848549 ns |
1859678 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184280166 ns |
184630708 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159478458 ns |
172192624.5 ns |
0.93 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152113791 ns |
146314396 ns |
1.04 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
414042208 ns |
415449708 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16517300 ns |
16494786 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
425300833 ns |
428781042 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
256639521 ns |
259710791 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
234680125.5 ns |
231751208 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
486551125 ns |
484878833 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
184291 ns |
183625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
184583 ns |
183375 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184625 ns |
184417 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
184312.5 ns |
182667 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230607.5 ns |
177771.5 ns |
1.30 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
631437.5 ns |
590604 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
637750 ns |
588083 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
639709 ns |
586792 ns |
1.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
638791 ns |
586958 ns |
1.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1098424 ns |
1015783.5 ns |
1.08 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3847292 ns |
3860917 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3628167 ns |
3732375 ns |
0.97 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3558812.5 ns |
3478062.5 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5353375 ns |
5358854.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
532230.5 ns |
533317.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17384021 ns |
17452375 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17275646 ns |
17779209 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
17134208 ns |
16551750 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
22096708 ns |
22184000 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2624383 ns |
2614491.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32858 ns |
32765 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
9625 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9708 ns |
9542 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9625 ns |
9625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9250 ns |
8917 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
268928.5 ns |
263711.5 ns |
1.02 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
496244541 ns |
501494042 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
428631687.5 ns |
411555459 ns |
1.04 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
471025146 ns |
374781084 ns |
1.26 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
672608229 ns |
672198042 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12480226 ns |
12477100 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2045534291.5 ns |
2044775145.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1629126042 ns |
1660536667 ns |
0.98 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1545195479.5 ns |
1495631604 ns |
1.03 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2203923687.5 ns |
2221523375 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49226660 ns |
49258137.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1635291.5 ns |
1643291 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1186250 ns |
1172917 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1395042 ns |
1391041.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2479729 ns |
2338333 ns |
1.06 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
217632.5 ns |
215612.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12692396 ns |
12698542 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9939208 ns |
9998999.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9850125 ns |
9717041 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18465125 ns |
18433792 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2037012 ns |
2039696 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17664667 ns |
17679687.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14690937.5 ns |
14770854.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14825583.5 ns |
14602583.5 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21424812.5 ns |
21327625 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26334 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26209 ns |
26291 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26292 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26209 ns |
26208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24221 ns |
24225 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67209 ns |
67250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66709 ns |
66834 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67000 ns |
68166 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66917 ns |
66792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
406228 ns |
378162.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204083 ns |
203125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209542 ns |
208500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209375 ns |
208666 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199750 ns |
200125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26259 ns |
26005 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
613416 ns |
646625 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
667562.5 ns |
628813 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
632875 ns |
669895.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631229 ns |
580791.5 ns |
1.09 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
355484.5 ns |
311381 ns |
1.14 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
657792 ns |
651667 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
610750 ns |
638666 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
664458 ns |
647417 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
646729 ns |
653083.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132238 ns |
131397 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2245583.5 ns |
2243375 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2218750 ns |
2314937.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2323646 ns |
2249625 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2243666 ns |
2235375 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1254388 ns |
1114755 ns |
1.13 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17334 ns |
18291 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18979 ns |
17500 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19562.5 ns |
20917 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17875 ns |
18292 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146014 ns |
143094 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
230583.5 ns |
223500 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
233541 ns |
226042 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228625 ns |
262917 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
231208 ns |
230125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1062578.5 ns |
943015 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
625 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
666 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23616 ns |
23380 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10458 ns |
10104.5 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
10166 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10000 ns |
10000 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10167 ns |
9583 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
261064 ns |
254915.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5334 ns |
5084 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5458 ns |
5375 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6854.5 ns |
6791 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5375 ns |
5250 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
236061.5 ns |
190346.5 ns |
1.24 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7541 ns |
7250 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7125 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7792 ns |
7250 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7291 ns |
7083 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
805015.5 ns |
735734 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2209 ns |
2167 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2208 ns |
2208 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2292 ns |
2209 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2250 ns |
2417 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18184 ns |
18111 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6584 ns |
6750 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6584 ns |
6375 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6645.5 ns |
6625 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6750 ns |
6625 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
332908 ns |
306022.5 ns |
1.09 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749687 ns |
751583.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746583 ns |
748875 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
748687.5 ns |
746812.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
749166.5 ns |
748500 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21336 ns |
21064 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
793250 ns |
791834 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
776437.5 ns |
788667 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
791542 ns |
786646.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
792167 ns |
792479 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
297866.5 ns |
294710 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7166 ns |
7417 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5208 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5334 ns |
6000 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10084 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33133 ns |
33108.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
260542 ns |
228645.5 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
270770.5 ns |
231416 ns |
1.17 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
236334 ns |
271625 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257584 ns |
225958 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
360496.5 ns |
351410 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9625 ns |
10292 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10125 ns |
10084 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11083 ns |
11166 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10791 ns |
10000 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
247606.5 ns |
209596.5 ns |
1.18 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24583 ns |
24709 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24666 ns |
24333 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25209 ns |
24291 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24812.5 ns |
24437.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1126313 ns |
1037550 ns |
1.09 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106238125 ns |
107199542 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
118978937.5 ns |
126347334 ns |
0.94 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
124037917 ns |
120468625 ns |
1.03 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117504750 ns |
117762042 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2653510 ns |
2637816 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
392967833 ns |
393813416 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
366459917 ns |
380007916 ns |
0.96 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
362566708.5 ns |
355873375 ns |
1.02 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
484255125 ns |
484550250 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15300329 ns |
15152772.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
937284708 ns |
939763875 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
591951208 ns |
777743792 ns |
0.76 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
756812791 ns |
745742833 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
951623104 ns |
767071771.5 ns |
1.24 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7167 ns |
1 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7375 ns |
6833 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8542 ns |
8458 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7084 ns |
7562.5 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
237393.5 ns |
228024 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14625 ns |
14250 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14375 ns |
14042 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14875 ns |
13875 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14292 ns |
13333 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1084284.5 ns |
1000779 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6083 ns |
6167 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6125 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7458 ns |
8250 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5959 ns |
5604.5 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
233786.5 ns |
214266.5 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13542 ns |
12417 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12583 ns |
12542 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13042 ns |
12875 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12958 ns |
12541 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
788004.5 ns |
724930 ns |
1.09 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
340292 ns |
349208 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
345625 ns |
326145.5 ns |
1.06 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
426020.5 ns |
393333 ns |
1.08 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
310917 ns |
314271 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16775 ns |
17228 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
703458.5 ns |
706500 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
735146 ns |
739437.5 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1029625 ns |
1020354 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
653041.5 ns |
658541 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
198474 ns |
198297 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23667 ns |
23935.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6833 ns |
6500 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6333 ns |
6584 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6541 ns |
6584 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6458 ns |
6250 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
239358 ns |
240134 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5917 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5917 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5834 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24385 ns |
24721 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22000 ns |
21500 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21083 ns |
21333 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21917 ns |
21292 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21375 ns |
21208 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
262229.5 ns |
262379.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144104.5 ns |
144229.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144708 ns |
144042 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
156875 ns |
147292 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147646 ns |
145833 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167190 ns |
167351 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1330708 ns |
1320395.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1308979 ns |
1358771 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1381125 ns |
1324084 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1321999.5 ns |
1329333.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1349304 ns |
1268788 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24041 ns |
24083 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22166 ns |
22375 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25083 ns |
25104.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23375 ns |
21917 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
354283 ns |
280502 ns |
1.26 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
180791 ns |
131646 ns |
1.37 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
129000 ns |
121334 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
185792 ns |
177687.5 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
131125 ns |
130209 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1467747 ns |
1380349 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
416 ns |
416 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23032 ns |
23199 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6792 ns |
6708 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
7083 ns |
0.91 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6708 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6708 ns |
6083 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
255727 ns |
258254.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4166 ns |
5042 ns |
0.83 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4542 ns |
4500 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5208.5 ns |
4917 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5083 ns |
4917 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
256155 ns |
243109 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
10375 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10416 ns |
10042 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10583 ns |
10125 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10167 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1350723 ns |
1338362 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1667 ns |
1667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1542 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22939 ns |
23629 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6000 ns |
5875 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5667 ns |
5666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6000 ns |
5958 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5625 ns |
5625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
272831 ns |
278503 ns |
0.98 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6857937.5 ns |
6825854.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6411896 ns |
6429125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6553167 ns |
6541187.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7666479 ns |
7656375 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
213835 ns |
215102 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24054084 ns |
24080834 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21280000 ns |
21338208 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21185791.5 ns |
21079333 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29741959 ns |
29660375 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2122169 ns |
2111008 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48688792 ns |
48564000 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
34555896 ns |
45595770.5 ns |
0.76 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
46213562.5 ns |
45721854 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49348041.5 ns |
38038271 ns |
1.30 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6125 ns |
5687.5 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6583 ns |
6041 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7375 ns |
6917 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5375 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
235412.5 ns |
239823 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
8291 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8166 ns |
8500 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8250 ns |
8750 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
8750 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1059126.5 ns |
1069933 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1547792 ns |
1555021 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1268875 ns |
1235375.5 ns |
1.03 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1642125 ns |
1618375 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2104375 ns |
2095209 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
272083 ns |
285020 ns |
0.95 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7919145.5 ns |
7898542 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6557000 ns |
6630645.5 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7291375 ns |
7200958 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10450458 ns |
10372854.5 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1842888 ns |
1904820 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
337250.5 ns |
342000 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
349333 ns |
323833 ns |
1.08 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
435459 ns |
382208 ns |
1.14 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
338979.5 ns |
342042 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46473 ns |
43080 ns |
1.08 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
749250 ns |
725958 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
785041.5 ns |
782938 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1073959 ns |
1067750 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
758166 ns |
737041.5 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
310254 ns |
314201.5 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397250 ns |
397583 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287958 ns |
211916 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
212208 ns |
288208 ns |
0.74 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751500 ns |
750834 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44386.5 ns |
44587.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
665041 ns |
670500 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
531625 ns |
470708 ns |
1.13 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
475375 ns |
531792 ns |
0.89 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973541 ns |
974083 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
190915 ns |
192970 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
675770.5 ns |
651646 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
542104.5 ns |
644458.5 ns |
0.84 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
678062.5 ns |
659271 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
664542 ns |
645333 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132463.5 ns |
132814 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2463958 ns |
2440750 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2450646 ns |
2525916.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2558312.5 ns |
2439124.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2453333 ns |
2464750 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1360044 ns |
1349058.5 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
341187.5 ns |
344292 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
346250.5 ns |
326104 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
423125 ns |
393875 ns |
1.07 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
308083 ns |
312896 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16649 ns |
16925 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
701917 ns |
709938 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
730292 ns |
739917 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1026459 ns |
1021708 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
645479.5 ns |
650083.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
200069 ns |
202873.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1462750 ns |
1458625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1503917 ns |
1490666 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1495500 ns |
1498417 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1440500 ns |
1436416 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41326 ns |
41016 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5124542 ns |
5105458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5186000 ns |
5294583 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5326708 ns |
5292167 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4970396 ns |
5007208 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198810.5 ns |
201135.5 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3750 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3667 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33150 ns |
33479.5 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15458 ns |
15292 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15083 ns |
15125 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15167 ns |
15291 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15167 ns |
15042 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
380554 ns |
381756.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
70833 ns |
71209 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
70958 ns |
71250 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71708 ns |
71125 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
70959 ns |
70062.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113758.5 ns |
114111 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317833 ns |
318250 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
332729.5 ns |
329625 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
328583 ns |
318708 ns |
1.03 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
317875 ns |
317958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
195020 ns |
197229.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1083 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
24076 ns |
24163 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8167 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
8041 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8459 ns |
8667 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7958 ns |
7625 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
263092.5 ns |
264271.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
465833.5 ns |
464166.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
464583 ns |
448167 ns |
1.04 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
558958 ns |
553459 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
545792 ns |
548917 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130415.5 ns |
129241.5 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1385791 ns |
1380229 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1372375 ns |
1393229 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1649167 ns |
1619541 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1576791.5 ns |
1590270.5 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
275691.5 ns |
277974 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32361 ns |
32417 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6375 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6417 ns |
6500 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6541 ns |
6542 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
5958 ns |
5958 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
266723 ns |
267135 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1725542 ns |
1723834 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1722333 ns |
1731042 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1740249.5 ns |
1722458 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1775209 ns |
1727375 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169600.5 ns |
168945.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4369709 ns |
4366646 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4361042 ns |
4396958.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4418959 ns |
4374416.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4359417 ns |
4349500 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1195476 ns |
1192401 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6709 ns |
6750 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6667 ns |
6541 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7708 ns |
7292 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6958 ns |
6542 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
19965 ns |
20406 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
67750 ns |
81771 ns |
0.83 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
32833 ns |
49083 ns |
0.67 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
69833.5 ns |
72271 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51750 ns |
51334 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
210322.5 ns |
213340.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
348792 ns |
354167 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
347104 ns |
329541.5 ns |
1.05 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
438584 ns |
401083 ns |
1.09 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
319333 ns |
321771 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18398 ns |
18865 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
722167 ns |
722646.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
735291 ns |
740500 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1042208 ns |
1030625 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
674229 ns |
673875 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
347323.5 ns |
350549.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
74875 ns |
75250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75208 ns |
75250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75875 ns |
75458 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75083 ns |
75042 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47177 ns |
47823 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324167 ns |
324625 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
326833 ns |
341667 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
334375 ns |
324250 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
324000 ns |
330833 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
210646 ns |
216202 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1484084 ns |
1485500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1529667 ns |
1517334 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1521625 ns |
1526000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1466416 ns |
1463167 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52455 ns |
53576 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5102666 ns |
5124354.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5285791 ns |
5278542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5324479 ns |
5287917 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4984666 ns |
4986958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
207886 ns |
209445 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28292 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28291 ns |
28250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28208 ns |
28208 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28292 ns |
28291 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24996 ns |
25452 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66541 ns |
66333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66250 ns |
66250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66250 ns |
66250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66667 ns |
66333 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
535846 ns |
539628 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1500625 ns |
1483687.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1152708 ns |
859791.5 ns |
1.34 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
946334 ns |
1143208 ns |
0.83 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2247854 ns |
2247229.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
585255 ns |
585407 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3068333.5 ns |
3085000 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2645417 ns |
2591208 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2631750 ns |
2737895.5 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3815708.5 ns |
3816250 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2083950 ns |
2035890 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8841959 ns |
8818187.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8418083 ns |
8953500 ns |
0.94 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8767584 ns |
8776854 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6380750 ns |
6365041 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
133083 ns |
80791 ns |
1.65 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81375 ns |
79875 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82520.5 ns |
82792 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80666 ns |
80708 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193555.5 ns |
194256.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020770.5 ns |
2013375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1954167 ns |
1748958 ns |
1.12 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2050958 ns |
2018500 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2013083 ns |
2022750 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
805558 ns |
809328 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.