This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: auto-training mode and strict checks #145
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/warn_no_train
branch
from
August 29, 2024 18:55
6805ccf
to
fb000d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: fb000d0 | Previous: 56e40d8 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5874.5 ns |
6083.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5791 ns |
5729 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7334 ns |
8208 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5771 ns |
7417 ns |
0.78 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
115557 ns |
119536 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2721197 ns |
2858698 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
751875 ns |
774000 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
414784 ns |
413554 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10250 ns |
9750 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9708 ns |
9541.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9791 ns |
9833 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
10000 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
535648 ns |
548421 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
6522514 ns |
6391891 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2497541 ns |
13032833 ns |
0.19 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
682006 ns |
680216 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1542 ns |
1333 ns |
1.16 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1667 ns |
3125 ns |
0.53 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
2000 ns |
2750 ns |
0.73 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1458 ns |
1646 ns |
0.89 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21274 ns |
21670 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1314672 ns |
1360152 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
202792 ns |
200500 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31600 ns |
30925.5 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3666 ns |
4000.5 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4000 ns |
3542 ns |
1.13 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4083.5 ns |
4250 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3979 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
142562 ns |
146351 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8818919 ns |
9417207 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1444500 ns |
1465833.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
150661.5 ns |
148801 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57417 ns |
57959 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47062.5 ns |
46875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46958 ns |
46666 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82666 ns |
82958 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36368 ns |
37604.5 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
573767 ns |
581086 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1032834 ns |
1034333.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
79390 ns |
79736 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2308000 ns |
2032625 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2219209 ns |
2089666 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2265792 ns |
2087125 ns |
1.09 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2271083 ns |
1994500 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
458763 ns |
234123 ns |
1.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
7651123 ns |
7389099 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7033375 ns |
5422500 ns |
1.30 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1308453 ns |
1219571 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
147833 ns |
164187.5 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
173437.5 ns |
153833 ns |
1.13 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
151270.5 ns |
150458 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
146208.5 ns |
167666.5 ns |
0.87 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165164 ns |
166286 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7366575 ns |
7718471.5 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1388500 ns |
1555229.5 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
191773 ns |
190051 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1118292 ns |
1111646 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1109000 ns |
1113250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1120333 ns |
1116271 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1118750 ns |
1115771 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
681370.5 ns |
700098 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36156481 ns |
33735820.5 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6091666 ns |
6479708 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
922886 ns |
1025985 ns |
0.90 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5666.5 ns |
5708.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4229 ns |
4292 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6416.5 ns |
5708 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5312.5 ns |
6395.5 ns |
0.83 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
89522 ns |
93278.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5362277 ns |
5349506.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
454125 ns |
453417 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
72051 ns |
59270 ns |
1.22 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
8625 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8584 ns |
8667 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
9000 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9083 ns |
8792 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
589330 ns |
614663 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
37693204.5 ns |
34665842.5 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5442709 ns |
5535062.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391815 ns |
384114 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17937.5 ns |
19000 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18250 ns |
18708.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21500 ns |
20209 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18291.5 ns |
18542 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
65529 ns |
66269 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2847915 ns |
2900988 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1318875 ns |
1296084 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
73415.5 ns |
73291 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213209 ns |
222000 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
224292 ns |
211750 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216083.5 ns |
223125 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
223875 ns |
221021 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
344520 ns |
354565.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
12592357.5 ns |
13542856.5 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5804917 ns |
6042354 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
477606 ns |
480655 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
667 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
1042 ns |
916 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
791 ns |
0.79 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20212 ns |
20668 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1171756.5 ns |
1159361.5 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
286125 ns |
278000 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
34560 ns |
34450 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1520.5 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1667 ns |
1583 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
122479 ns |
125954.5 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8995739 ns |
8769872 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1460625 ns |
1445458 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
130722 ns |
128201.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7417 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6125 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6166 ns |
6167 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10333 ns |
10292 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23423.5 ns |
24236 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1238741 ns |
1337008 ns |
0.93 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
621896.5 ns |
513333 ns |
1.21 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49041 ns |
48301 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
371104 ns |
220959 ns |
1.68 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
413916 ns |
269208.5 ns |
1.54 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
422208 ns |
263750 ns |
1.60 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
363833.5 ns |
225583 ns |
1.61 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
407257 ns |
192759 ns |
2.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31598972 ns |
29975947 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9262166.5 ns |
8958750 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
730489 ns |
608296 ns |
1.20 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4084 ns |
4083 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4084 ns |
4083 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4083 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4042 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23414 ns |
23710 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2073024 ns |
1967682 ns |
1.05 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
222041.5 ns |
220708 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52231 ns |
48820 ns |
1.07 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16834 ns |
16625 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
17333 ns |
16667 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17125 ns |
17125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16917 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
188995.5 ns |
195325 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
11366884 ns |
9992661.5 ns |
1.14 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
932458.5 ns |
956125 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
178432 ns |
177522 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
510041 ns |
509750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405334 ns |
404666 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
404000 ns |
405500 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865041 ns |
864500 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113309 ns |
113934 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
393832 ns |
399968.5 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
385354 ns |
452750 ns |
0.85 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
249783 ns |
248142 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2327625 ns |
2323875 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2029083 ns |
2027687 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2034187 ns |
2035333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3282959 ns |
3278166 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
236546 ns |
240558 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11904169 ns |
9061829 ns |
1.31 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1917833 ns |
1864375 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
762089 ns |
762112 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6500 ns |
6791.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6500 ns |
6354.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8229.5 ns |
8021 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6333 ns |
7520.5 ns |
0.84 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
88687.5 ns |
92014.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5523378 ns |
5475275 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
719916 ns |
726667 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
62181 ns |
60220 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12375 ns |
11083.5 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11417 ns |
11729 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12500 ns |
12750 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12458.5 ns |
11291.5 ns |
1.10 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
611796 ns |
656827.5 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37468397.5 ns |
40222366 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5362250 ns |
5480792 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
418230 ns |
413864 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
22927 ns |
23122 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2183765 ns |
2175700 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
222437.5 ns |
322625 ns |
0.69 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
53110 ns |
53980 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2083 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2208 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
210803 ns |
223205.5 ns |
0.94 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11049931 ns |
11597723.5 ns |
0.95 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1947417 ns |
1948500 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
176312 ns |
183602 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9042 ns |
8812 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8083.5 ns |
8583.5 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11895.5 ns |
10667 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8875 ns |
8604.5 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
91893.5 ns |
99076.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3103648 ns |
2921578.5 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
777541.5 ns |
808625 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
78651 ns |
78481 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16958 ns |
17750.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18145.5 ns |
18334 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
19000 ns |
18792 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17750 ns |
18104.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
534997 ns |
609459 ns |
0.88 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
15733339.5 ns |
16546891.5 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
4951333.5 ns |
5201875 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
394905 ns |
392264 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
541 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
541 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34529 ns |
35893 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1235951 ns |
1234257.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
287292 ns |
308542 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
48301 ns |
45990.5 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
147000 ns |
10208 ns |
14.40 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
148417 ns |
9042 ns |
16.41 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
161208.5 ns |
11208 ns |
14.38 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
148333 ns |
9666 ns |
15.35 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
451818 ns |
268063.5 ns |
1.69 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18323518 ns |
18168061 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4753042 ns |
4946250 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
537806 ns |
374863 ns |
1.43 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396958 ns |
397125 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288125 ns |
287584 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287979.5 ns |
288125 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755542 ns |
756000 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111265.5 ns |
112334 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
328662 ns |
331349 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
367000 ns |
453166 ns |
0.81 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
78721 ns |
78321 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1454062.5 ns |
1442312.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1136416 ns |
1128583 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1132354.5 ns |
1136375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2439916 ns |
2441021 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
204364.5 ns |
207111 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
9824415 ns |
10702686 ns |
0.92 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1570541.5 ns |
1560271 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
325414 ns |
324973 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7395.5 ns |
7208.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7666.5 ns |
7166.5 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9792 ns |
8250 ns |
1.19 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7145.5 ns |
7750 ns |
0.92 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
129489 ns |
148537.5 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5955624 ns |
5887503 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
439000 ns |
464209 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
60971 ns |
59820 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17125 ns |
16500 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15771 ns |
15041.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16520.5 ns |
15708 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15375 ns |
15041 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
855248.5 ns |
975911 ns |
0.88 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44381652.5 ns |
46239281 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5527750 ns |
5635271 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
435840 ns |
439474 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
30500 ns |
25083 ns |
1.22 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27834 ns |
26354.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
31292 ns |
29833 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25771 ns |
25291 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
191688 ns |
200872.5 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7437045.5 ns |
7942712.5 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
586292 ns |
976813 ns |
0.60 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
118032 ns |
117671 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
154208 ns |
103917 ns |
1.48 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
150125 ns |
154250 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
153709 ns |
143979 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104021 ns |
112208 ns |
0.93 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1029587 ns |
1080618 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43318293.5 ns |
45956117 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5712583.5 ns |
5734166.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
600207 ns |
598555 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74500 ns |
77125 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
85875 ns |
74229 ns |
1.16 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
84000 ns |
79687.5 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75208 ns |
75770.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
203652.5 ns |
207270 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7714505.5 ns |
7792417.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
526209 ns |
522646 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
125851 ns |
123966 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221270.5 ns |
287541.5 ns |
0.77 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
295000 ns |
301792 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
302375 ns |
295041 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221417 ns |
218208 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1101887 ns |
1107042 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42953193 ns |
43388234 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6252458.5 ns |
6243958 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
702428 ns |
701281.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16375 ns |
17834 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
18000 ns |
16625 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18374.5 ns |
18104.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
17083 ns |
16729.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
148672.5 ns |
150748.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5794282 ns |
5427549 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
443708 ns |
452333 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
240983 ns |
237672 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27270.5 ns |
27500.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26687.5 ns |
27583 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27333 ns |
28875 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27167 ns |
25146 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
956079 ns |
981795 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
42186646 ns |
43049751.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5738708 ns |
5608208 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
716559 ns |
715207 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11291 ns |
12313 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
10812.5 ns |
10020.5 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13646 ns |
12417 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11833 ns |
11208.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
121242 ns |
122999 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3475987.5 ns |
3856591 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
857542 ns |
783354.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
243332.5 ns |
244302 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22416 ns |
22250 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22458 ns |
21396 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21833 ns |
23000 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
22333 ns |
21687.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
685433 ns |
704827 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21324187 ns |
19822430 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5433000 ns |
5200895.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
684518 ns |
687056 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
64291 ns |
63625.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
62291 ns |
62583 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68000 ns |
66646 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
63458 ns |
66937.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
106200.5 ns |
105671 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3278706 ns |
3419840 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1343125 ns |
1336188 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
240213 ns |
238667 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449625 ns |
475666 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
450041.5 ns |
448750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
452875 ns |
446208 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
473708 ns |
478625 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
506804 ns |
516873.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20711437 ns |
20193366 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6340770.5 ns |
6184938 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
734073.5 ns |
717327 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6750 ns |
7271 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8333.5 ns |
7083 ns |
1.18 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9583 ns |
8250 ns |
1.16 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8084 ns |
7521 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
144433.5 ns |
146807 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5559405 ns |
5447467.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
442333 ns |
462959 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59460.5 ns |
61520 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15271 ns |
15854.5 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15042 ns |
13895.5 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16687.5 ns |
15458 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17292 ns |
14041 ns |
1.23 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
925112 ns |
952735 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38208450 ns |
39171319.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5445417 ns |
5387667 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
403345 ns |
406764 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6148375 ns |
6150187.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6378583 ns |
6375084 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6377979.5 ns |
6377896 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11916021 ns |
11916958 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
347485 ns |
345906.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
294433.5 ns |
293393 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19132250 ns |
19109896 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19973479 ns |
19969688 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
20014417 ns |
19911667 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36557250 ns |
36665438 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1025341.5 ns |
1011944.5 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1151734 ns |
1168811 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1000 ns |
917 ns |
1.09 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
959 ns |
959 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
959 ns |
917 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23037 ns |
23377 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2116275 ns |
2088445 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
288334 ns |
218062.5 ns |
1.32 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
215612 ns |
214272.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3625 ns |
3667 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3875 ns |
3791 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
275557 ns |
284240 ns |
0.97 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11382415 ns |
10831129 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2104917 ns |
2013458 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
651302.5 ns |
642396 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8167 ns |
7834 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7792 ns |
8250 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9542 ns |
9208.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8563 ns |
9125 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
119002 ns |
120248 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3379292 ns |
3419283 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
784000 ns |
777875.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
72261 ns |
68341 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11041 ns |
11875 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11209 ns |
12167 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12958 ns |
12625 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10812.5 ns |
12709 ns |
0.85 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
626600.5 ns |
645406 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22838448 ns |
20750949 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
4545250 ns |
4833541 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
371774 ns |
362853 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22136 ns |
22453 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2088688 ns |
2036768.5 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
221666.5 ns |
218625 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
53240 ns |
51501 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2833 ns |
2833 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2833 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3166 ns |
3209 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
2834 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
198602.5 ns |
203537.5 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9486294 ns |
9796958 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1514917 ns |
1523062.5 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
161262 ns |
161481 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12084 ns |
11875 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11792 ns |
10875 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14292 ns |
13208 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
12000 ns |
11875 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
119093 ns |
120852 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3487542 ns |
3578611 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
826000 ns |
824000 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
243253 ns |
240102 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20416.5 ns |
21792 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20479.5 ns |
21834 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
20729 ns |
22271 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
19479.5 ns |
20584 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
579311 ns |
600222 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20554218 ns |
20377991.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4772875 ns |
4668250 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
652368 ns |
663226 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4417 ns |
4416 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24066.5 ns |
24569 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2256365 ns |
2264680 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223000 ns |
222854.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52711 ns |
52551 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16667 ns |
16542 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16417 ns |
16458 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16500 ns |
16750 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16625 ns |
16584 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
325746.5 ns |
329740.5 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12327891 ns |
12210269.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1081354.5 ns |
1074708 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
211922.5 ns |
212612 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
1958 ns |
2083 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
2084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2041 ns |
2167 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
1958 ns |
1958 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
35040 ns |
36693 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1241357 ns |
1172885.5 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
286125 ns |
289959 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
207902 ns |
206982 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
148896 ns |
17541.5 ns |
8.49 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
144771 ns |
19584 ns |
7.39 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
149292 ns |
18875 ns |
7.91 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
145000 ns |
19896 ns |
7.29 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
484165.5 ns |
291009.5 ns |
1.66 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
10024099 ns |
19691709 ns |
0.51 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4663146 ns |
4873604 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
819939 ns |
691816 ns |
1.19 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
58833.5 ns |
59750 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
66541.5 ns |
65125 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
66854 ns |
66229 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51250 ns |
51250 ns |
1 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66577 ns |
66341 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
101381 ns |
96856 ns |
1.05 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
134562.5 ns |
149084 ns |
0.90 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
166750 ns |
109437.5 ns |
1.52 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
164792 ns |
142625 ns |
1.16 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
233729 ns |
252625 ns |
0.93 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
213970 ns |
218082 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
586847 ns |
579290.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83583 ns |
128229.5 ns |
0.65 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
86166 ns |
124458 ns |
0.69 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86791 ns |
121520.5 ns |
0.71 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84666 ns |
84354 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193431 ns |
193150.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5266325 ns |
5581378 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1961875 ns |
1913292 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
170922 ns |
170532 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1910750 ns |
1825541.5 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1913479.5 ns |
1917500 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1930458.5 ns |
1726708 ns |
1.12 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1911667 ns |
1896375 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
526485 ns |
531416 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25811415 ns |
25804121 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8890041 ns |
9091041.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1086682 ns |
1081700 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21239 ns |
21564 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2029526 ns |
2150170.5 ns |
0.94 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
329625 ns |
322646 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
45115.5 ns |
44940 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1791 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
249589.5 ns |
253017.5 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
10044931 ns |
9954512 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1052750 ns |
1489959 ns |
0.71 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
183272 ns |
184662 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10083 ns |
8208 ns |
1.23 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8854.5 ns |
8354 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11041.5 ns |
11062.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9000 ns |
11375 ns |
0.79 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
116531 ns |
117709 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3670049 ns |
3377279 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
818083.5 ns |
841167 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
239873 ns |
237972 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8541 ns |
10750 ns |
0.79 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8479.5 ns |
9625 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8854.5 ns |
10333 ns |
0.86 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8084 ns |
9437.5 ns |
0.86 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
513528 ns |
528567.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20228071.5 ns |
20481875 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
3987521 ns |
4066875 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
650677.5 ns |
650956 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57958 ns |
58458 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46500 ns |
46834 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46417 ns |
46541 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83000 ns |
82770.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38759 ns |
40116 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1415113 ns |
1353318 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1142604 ns |
1107646 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76641 ns |
75891 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2016958 ns |
1830958 ns |
1.10 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2108125 ns |
1987709 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2126125 ns |
1806000 ns |
1.18 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2031604.5 ns |
1902167 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
403273 ns |
224182 ns |
1.80 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
12592175 ns |
33930875 ns |
0.37 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11201375 ns |
11292291.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1027411 ns |
1025890 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
418375 ns |
418083 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
418875 ns |
418854.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421396 ns |
419624.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
431167 ns |
418083 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
207671.5 ns |
210311 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7803352.5 ns |
7920822 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
527542 ns |
525521 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
287424 ns |
284163 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
757041.5 ns |
669416.5 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
668625 ns |
671291.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
677167 ns |
684750 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
669208 ns |
684021 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1038754 ns |
1058312 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45709188 ns |
44385100.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6421375 ns |
6341125 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
924830 ns |
918153 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3445500 ns |
3455395.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3467104 ns |
3437542 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3440104 ns |
3456500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3397583 ns |
3441812 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169978 ns |
173936 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8429258 ns |
8236547 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1333667 ns |
1383541.5 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
431205 ns |
408024 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6187042 ns |
6212292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6193833.5 ns |
6192374.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6188917 ns |
6230104.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6203437.5 ns |
6210542 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
985127.5 ns |
1001699 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50910191.5 ns |
52343757.5 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7089083 ns |
7314167 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1562022.5 ns |
1560500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
471645.5 ns |
471667 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341416 ns |
341500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
341958 ns |
341250 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
903541 ns |
901083.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46127 ns |
46237 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
834768.5 ns |
841979 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
398208 ns |
403541.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
251432 ns |
251513 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2328124.5 ns |
2304875 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2034625 ns |
2036291 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2036333 ns |
2035208 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3318312.5 ns |
3278208.5 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
265130.5 ns |
256609 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
16240147 ns |
13028144 ns |
1.25 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2194667 ns |
2192084 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
792439 ns |
788718 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57583 ns |
57833 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46500 ns |
46584 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46291 ns |
46083 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82542 ns |
83709 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
27969 ns |
28664 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1424400 ns |
1387212 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1133750 ns |
1120333 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76011 ns |
77001 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2193084 ns |
1999146 ns |
1.10 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2214979 ns |
2075834 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2249042 ns |
1881917 ns |
1.20 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2140854 ns |
1993250 ns |
1.07 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
445719.5 ns |
229523 ns |
1.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
38789932 ns |
36882455.5 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11702791.5 ns |
11806542 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1094522 ns |
1046160 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
57917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47083 ns |
47250 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46500 ns |
46750 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82833.5 ns |
83250 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47723 ns |
49927 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
785704 ns |
787820 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1095166 ns |
1080583 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77990.5 ns |
73870 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2086062.5 ns |
1891083.5 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2116208 ns |
1970208 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2144666.5 ns |
1955187.5 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2046292 ns |
1904291 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
487919.5 ns |
234662 ns |
2.08 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28265363.5 ns |
18211943.5 ns |
1.55 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10193083.5 ns |
10103542 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1010551 ns |
933389 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
33775 ns |
34917 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1145972 ns |
1226168.5 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
269646 ns |
272625 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
50361 ns |
47950 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
143208 ns |
7479.5 ns |
19.15 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
141625 ns |
6792 ns |
20.85 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
144500 ns |
8375 ns |
17.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
140292 ns |
7291 ns |
19.24 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
430404 ns |
203469.5 ns |
2.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21186553 ns |
20240791 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4566833 ns |
4583500 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
517175 ns |
374783 ns |
1.38 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32093 ns |
31986 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1228680 ns |
1276838 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
251125 ns |
250958.5 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40461 ns |
39251 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2667 ns |
3000 ns |
0.89 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2834 ns |
2666 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
3000 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2709 ns |
3250 ns |
0.83 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
184759.5 ns |
193112 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7793763 ns |
7648888.5 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
963604.5 ns |
1228042 ns |
0.78 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
159696.5 ns |
155301 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
443666 ns |
423250 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
426750 ns |
422000 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
427375 ns |
426584 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
425854.5 ns |
433042 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
137307.5 ns |
138742 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
4108067 ns |
6023505.5 ns |
0.68 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2074313 ns |
2150209 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
326613 ns |
350454 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3803750 ns |
3765146 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3802666 ns |
3779584 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3822250 ns |
3801667 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3801875 ns |
3781770.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
703504 ns |
710296 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33026409 ns |
31895528 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10878208 ns |
10614458 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1505796 ns |
1323602 ns |
1.14 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49864187 ns |
49864000 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35504062.5 ns |
35497062 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35538333.5 ns |
35537125 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
96916667 ns |
96997520.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1592214 ns |
1604687 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
998606.5 ns |
1017349 ns |
0.98 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154562395.5 ns |
154531062.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112345958 ns |
112258062 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112532584 ns |
112366667 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
297051812.5 ns |
299279978.5 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6520428 ns |
6477003 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5676402.5 ns |
5749519.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
18229 ns |
19666.5 ns |
0.93 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
17083.5 ns |
18542 ns |
0.92 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
16042 ns |
17562.5 ns |
0.91 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15625 ns |
15437.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
21334 ns |
21582 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1139016 ns |
1137074 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
217520.5 ns |
219625 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26211 ns |
27981 ns |
0.94 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10834 ns |
10854.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9042 ns |
8916.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9291 ns |
9292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17521 ns |
17417 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
257401 ns |
261948.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10553951.5 ns |
9733148 ns |
1.08 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1502875 ns |
1502000 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
154702 ns |
152941 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8937.5 ns |
8021 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
9459 ns |
8458 ns |
1.12 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10208 ns |
10375 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8520.5 ns |
9583 ns |
0.89 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
122970.5 ns |
125031 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3704089 ns |
3572138 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
708166 ns |
766396 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
242933 ns |
239027.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9270.5 ns |
10270.5 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9791 ns |
9125 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9792 ns |
9833 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9229.5 ns |
9562 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
608318 ns |
626181 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25004839 ns |
23291818.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
4935416.5 ns |
5110500 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
655337 ns |
669926 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8812.5 ns |
9021 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9791.5 ns |
9292 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12083 ns |
10792 ns |
1.12 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8771 ns |
9624.5 ns |
0.91 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
118050 ns |
119634.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3451608 ns |
3516934 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
875291.5 ns |
854750 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72861 ns |
69771 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13250 ns |
16583 ns |
0.80 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13375 ns |
12583 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13916.5 ns |
14020.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12958 ns |
14104 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
580235.5 ns |
597754 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19330798 ns |
19866015 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4607833.5 ns |
4399958 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
354624 ns |
354973.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
459 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34114 ns |
35591 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1260842 ns |
1301172 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
274542 ns |
273021 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
209992 ns |
208092 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
138958 ns |
9667 ns |
14.37 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
136542 ns |
7917 ns |
17.25 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
140000 ns |
8667 ns |
16.15 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
136083 ns |
10542 ns |
12.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
438919.5 ns |
228879.5 ns |
1.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22996988 ns |
22093342 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4752042 ns |
4715584 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
792008 ns |
665037 ns |
1.19 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16250 ns |
16792 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
15333 ns |
18042 ns |
0.85 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
14625 ns |
15104 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
11792 ns |
10520.5 ns |
1.12 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21202 ns |
21410 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1151785 ns |
1212361 ns |
0.95 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
211187.5 ns |
204104 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
188407 ns |
189022 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
31708 ns |
31875 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
31958 ns |
31709 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32375 ns |
32312.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32166 ns |
32000 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
271317 ns |
276685 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10945554 ns |
10782012.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1602812.5 ns |
1597917 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
607056 ns |
603936 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
479875 ns |
444417 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
483521 ns |
440729.5 ns |
1.10 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
446874.5 ns |
483875 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
484500 ns |
487833 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195239 ns |
194859 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6117630 ns |
6150501 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1972500 ns |
1973354.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
353204 ns |
352013 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3824750 ns |
3829083 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3813709 ns |
3817542 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3827041 ns |
3807333.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3827375 ns |
3833750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
535235 ns |
543447.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29684309 ns |
29268457 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9492709 ns |
9074479 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1391296 ns |
1381293 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
783309416 ns |
782808250 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542819541 ns |
542955458 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
544218417 ns |
543245416 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1569457250 ns |
1526913187.5 ns |
1.03 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22543017 ns |
22538913 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14103431 ns |
14166095 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3007940334 ns |
2518672041 ns |
1.19 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1800866291 ns |
2247031041 ns |
0.80 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1792982875 ns |
2268043292 ns |
0.79 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
5315494791 ns |
4817775208 ns |
1.10 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
379420999 ns |
370296484 ns |
1.02 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
89218848 ns |
89108951 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76770.5 ns |
78291.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
80583 ns |
76292 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79646 ns |
78708.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76625 ns |
75666.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
205479.5 ns |
209649 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8543893 ns |
7907666.5 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
527542 ns |
527271 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
109941 ns |
110221 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
289687.5 ns |
267125 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
277688 ns |
192750 ns |
1.44 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
193750 ns |
228291 ns |
0.85 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
264166.5 ns |
274042 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1027978 ns |
1049164 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44758239 ns |
42798073 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6106375 ns |
5942583 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
645427 ns |
646896 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199655375 ns |
199999187.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
138887125 ns |
139287958 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139073250 ns |
139251125 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
394615000 ns |
388390459 ns |
1.02 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5834346 ns |
5842600.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3411391.5 ns |
3422748 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619240375 ns |
618321291.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
440309458 ns |
440516458 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
440130812.5 ns |
441449562.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1195765667 ns |
1184281125 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26511719.5 ns |
26535363 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
22284299 ns |
22224253 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7292 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6209 ns |
6041 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6166 ns |
6042 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
9959 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27372 ns |
28005 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1220411 ns |
1278165.5 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
513792 ns |
585916.5 ns |
0.88 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48370 ns |
47711 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
380229.5 ns |
214417 ns |
1.77 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
359459 ns |
220791 ns |
1.63 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
363875 ns |
221750 ns |
1.64 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
335125 ns |
208354 ns |
1.61 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
418348.5 ns |
227388 ns |
1.84 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31583019 ns |
32450442 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9289000 ns |
9078291.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
668577 ns |
531855 ns |
1.26 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10187.5 ns |
9708.5 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8667 ns |
7875 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10459 ns |
10334 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8166.5 ns |
8896 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
117261.5 ns |
116991 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3477087 ns |
3379932 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
755208 ns |
844125 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
68860 ns |
78200 ns |
0.88 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7541.5 ns |
10500 ns |
0.72 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7667 ns |
7209 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8292 ns |
7958 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7521 ns |
8125 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
510936.5 ns |
524925 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19179630 ns |
19844305 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4013292 ns |
4066333.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321554 ns |
322173 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
459 ns |
584 ns |
0.79 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
625 ns |
0.80 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25702 ns |
26471.5 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1237540.5 ns |
1215499 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
303395.5 ns |
366750 ns |
0.83 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
49140 ns |
48170.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
148125 ns |
12709 ns |
11.66 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
145187.5 ns |
9291 ns |
15.63 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
150208.5 ns |
10083.5 ns |
14.90 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
145416.5 ns |
9750 ns |
14.91 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
461372 ns |
258485 ns |
1.78 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23030274 ns |
22618575 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5062875 ns |
5040750 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
565486 ns |
393458.5 ns |
1.44 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
115125 ns |
108458 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
99688 ns |
98875 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
101270.5 ns |
100521 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146604 ns |
146417 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24185 ns |
24425.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1199345 ns |
1223728 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
261959 ns |
258208 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
191147 ns |
191276.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
498209 ns |
478583 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
517875 ns |
480437 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
478937.5 ns |
482104 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
477958 ns |
478875 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
229118 ns |
234461 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11518633.5 ns |
11870954.5 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2113625 ns |
2153250 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
624696.5 ns |
620366 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
4833 ns |
5416 ns |
0.89 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6979 ns |
5021 ns |
1.39 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7416.5 ns |
7000 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4646 ns |
4375 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
15972 ns |
16254 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
78941 ns |
79120 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11916 ns |
13209 ns |
0.90 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10833.5 ns |
10333 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11125 ns |
11187.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16562.5 ns |
16875 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
211125 ns |
214352 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
370104 ns |
369103.5 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
40292 ns |
39084 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
50709 ns |
51604 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
53437.5 ns |
52875 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13395.5 ns |
13500 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20045 ns |
21418 ns |
0.94 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86141 ns |
78891 ns |
1.09 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
41708 ns |
37875 ns |
1.10 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
32375 ns |
31625 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
30895.5 ns |
31125 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
56916.5 ns |
57625 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
188929 ns |
194392 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
417859.5 ns |
418224 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1979.5 ns |
1854.5 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1833.5 ns |
1709 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2375 ns |
2209 ns |
1.08 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1791.5 ns |
1958 ns |
0.91 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20678 ns |
21178 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1163936 ns |
1109464 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
291541 ns |
296916 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
28881 ns |
28730 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2104.5 ns |
2208 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2084 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2334 ns |
2270.5 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2167 ns |
2083 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
201809 ns |
204626 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9295907.5 ns |
9185930.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1432187.5 ns |
1373917 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
137651 ns |
136741 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5229.5 ns |
5187.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5687.5 ns |
5021 ns |
1.13 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6375 ns |
6708 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4375.5 ns |
5292 ns |
0.83 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
142965.5 ns |
146489.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5606105 ns |
5824136 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
449020.5 ns |
461083 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
62570 ns |
63631 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8271 ns |
9208.5 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8417 ns |
8000 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8625 ns |
8667 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8604.5 ns |
8708 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
858818 ns |
883915 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
40092469.5 ns |
38448137 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5355812.5 ns |
5404750 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
384874 ns |
389893 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56834 ns |
56750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57708 ns |
57625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57666 ns |
57750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58125 ns |
58292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37255 ns |
37898 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1223472.5 ns |
1225239 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
533354.5 ns |
482042 ns |
1.11 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
208377.5 ns |
206812 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
605750 ns |
461021.5 ns |
1.31 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
623562.5 ns |
464812.5 ns |
1.34 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
628313 ns |
465375 ns |
1.35 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
582834 ns |
443771 ns |
1.31 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
532888 ns |
262541 ns |
2.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
28144268 ns |
26994597 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8798458.5 ns |
8201187.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
928810 ns |
814918 ns |
1.14 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3332667 ns |
3312833 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2330333 ns |
2337166.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2339292 ns |
2336896 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6320375 ns |
6300021 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205868 ns |
204708 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
204322.5 ns |
210502 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11449916 ns |
11472333 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8300833.5 ns |
8296521 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8348000 ns |
8328708 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21108459 ns |
21128979.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
742474 ns |
742043 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1071261 ns |
1071150 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7041 ns |
6750 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
4709 ns |
1.30 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6750 ns |
6667 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5791 ns |
7083 ns |
0.82 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
136817.5 ns |
140008 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5779963 ns |
5632229.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
723333 ns |
742458 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
58111 ns |
58641 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7166 ns |
7209 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7334 ns |
6959 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7292 ns |
7541 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7042 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
740847.5 ns |
764718 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
37404171.5 ns |
36130678 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5084958 ns |
5226958.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
381784 ns |
383774 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
124250 ns |
138354.5 ns |
0.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
124166.5 ns |
98145.5 ns |
1.27 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
101708.5 ns |
101167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
113520.5 ns |
106375 ns |
1.07 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
150861 ns |
151797 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6135447 ns |
5992457 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2029958 ns |
2019062.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
210087 ns |
168952 ns |
1.24 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1997854.5 ns |
1834167 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1999916 ns |
2017208 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027166 ns |
2009167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2018708 ns |
2029979.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701304.5 ns |
712230.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32626383 ns |
31045366 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10780312.5 ns |
10914875 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1123801.5 ns |
1252337 ns |
0.90 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
33041 ns |
34437.5 ns |
0.96 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35959 ns |
37312.5 ns |
0.96 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
35604.5 ns |
35812 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
708 ns |
667 ns |
1.06 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15322 ns |
15573 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
81061 ns |
72020 ns |
1.13 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2542 ns |
2604.5 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2875 ns |
2792 ns |
1.03 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2937.5 ns |
2959 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2125 ns |
2291 ns |
0.93 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
136167 ns |
141728 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
365774 ns |
348083 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7000 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
5916 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6000 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10333 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35823 ns |
36891 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1228555 ns |
1194445.5 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
389833 ns |
487520.5 ns |
0.80 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
51200 ns |
49011 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
377104 ns |
241166.5 ns |
1.56 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
385167 ns |
221000 ns |
1.74 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
381167 ns |
221542 ns |
1.72 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
351916 ns |
206542 ns |
1.70 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
512269.5 ns |
240362 ns |
2.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27785411 ns |
25649976.5 ns |
1.08 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8478687.5 ns |
7897958.5 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
666837 ns |
521174 ns |
1.28 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
21501 ns |
21897 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2120479 ns |
2185292 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
241959 ns |
242250 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
45910 ns |
47251 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14959 ns |
14917 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15000 ns |
14875 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14875 ns |
15042 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14917 ns |
14875 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
305862 ns |
311422 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11568964.5 ns |
12486721 ns |
0.93 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
990500 ns |
997375 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
194662 ns |
204526.5 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
151292 ns |
109666.5 ns |
1.38 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
105542 ns |
103749.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
105666 ns |
105500 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
108334 ns |
120375 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
135171.5 ns |
152552 ns |
0.89 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5948811 ns |
5849524.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2056750 ns |
2042500 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
187802 ns |
185802 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1926208 ns |
1782959 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1919708 ns |
1919667 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1926500 ns |
1893875 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1919333 ns |
1914750 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
684579 ns |
695036 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30815788.5 ns |
30814201.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10672792 ns |
10692646 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1239603 ns |
1072840 ns |
1.16 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19584 ns |
19917 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18583 ns |
17584 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21292 ns |
21125 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17833.5 ns |
19291 ns |
0.92 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107607.5 ns |
109095.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3724684 ns |
3459371.5 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1354625 ns |
1363791.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75895.5 ns |
81501 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218833 ns |
216083 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216895.5 ns |
249854 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217146 ns |
216667 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215166.5 ns |
215958.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
512344 ns |
521547.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21928657 ns |
20634945 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6352541.5 ns |
6272667 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
494060.5 ns |
476034 ns |
1.04 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
25000 ns |
24875 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
28708 ns |
30916.5 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28000 ns |
30375 ns |
0.92 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1187.5 ns |
1250 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15742 ns |
16240 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
83041 ns |
82701 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4791 ns |
4375.5 ns |
1.09 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5167 ns |
4500 ns |
1.15 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5208.5 ns |
5271 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4750 ns |
4750 ns |
1 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
204557 ns |
208732 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
380439 ns |
382674 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
308208 ns |
306583 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
315750 ns |
306500 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
308333 ns |
309375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
305833.5 ns |
307083 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
227475.5 ns |
229206.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7965121 ns |
7783651 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
593750 ns |
1169270.5 ns |
0.51 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
278443 ns |
276543 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
559583 ns |
537021 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
539500 ns |
531791.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
533625 ns |
547104.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
529812.5 ns |
535708 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1062119 ns |
1083213 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43579112.5 ns |
45992002.5 ns |
0.95 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6188375 ns |
6107583.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
881479 ns |
867099 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22708 ns |
21292 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20083 ns |
20083 ns |
1 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21417 ns |
21458 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19542 ns |
20416 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112201 ns |
113930 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3652323 ns |
3636526.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1495604 ns |
1444708.5 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79055.5 ns |
77785.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212750 ns |
215625 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213167 ns |
215500 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215021 ns |
213812 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215542 ns |
216500 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
752835.5 ns |
748686 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24398027 ns |
25547600 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7403000 ns |
7444708 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
550166 ns |
542025 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
7458 ns |
6792 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6583 ns |
6875 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8271 ns |
8250 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7104 ns |
6750 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
139804.5 ns |
141039.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5773128 ns |
5493009 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
730937 ns |
712166.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70340 ns |
70750 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
10291 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10167 ns |
9708 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10542 ns |
10625 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10499.5 ns |
10333 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
813031 ns |
833353 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37615031 ns |
38773821 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5138875 ns |
5198104 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
384474 ns |
387004 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6708 ns |
6292 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5229 ns |
5125 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6708.5 ns |
6958 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7125 ns |
6875 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
141719 ns |
144971.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5586545 ns |
5893830.5 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
717396 ns |
727542 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
62111 ns |
60830 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7542 ns |
7479 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7333 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7542 ns |
7875 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7520.5 ns |
7292 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
771432 ns |
792524 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
41795223 ns |
40455917 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5408708 ns |
5364583.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
397734 ns |
399744 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14505854.5 ns |
14468708 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10125354.5 ns |
10147583 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10158937 ns |
10085542 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27749437.5 ns |
27811041 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
540274 ns |
542278 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
399624 ns |
384534 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46276833 ns |
46218541.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33361542 ns |
33417104.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33503958 ns |
33420958 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85348583 ns |
85450792 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2829546.5 ns |
2643632 ns |
1.07 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3291354.5 ns |
3285991 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66667 ns |
67625 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
65750 ns |
67500 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68833 ns |
69791 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66541 ns |
68750 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105401.5 ns |
119783 ns |
0.88 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3426262.5 ns |
3660768 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1470791.5 ns |
1428437.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
234912 ns |
230052 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
446541 ns |
439833 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
445937.5 ns |
443166.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
442354 ns |
444771 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
446375 ns |
454646 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
728201 ns |
731518 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26874397 ns |
27861313 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7885708 ns |
8299458.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
812439 ns |
775003 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
666 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31896 ns |
33044 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1266980.5 ns |
1125170 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
284750 ns |
329250 ns |
0.86 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
51781 ns |
48860 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
140375 ns |
9104 ns |
15.42 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
137583 ns |
8625 ns |
15.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
142625 ns |
10250 ns |
13.91 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
138042 ns |
9896.5 ns |
13.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
486965 ns |
284435 ns |
1.71 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
14162730 ns |
21375944 ns |
0.66 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4861750 ns |
5566187.5 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
517885 ns |
386843 ns |
1.34 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9792 ns |
9792 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9833 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9792 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
22840 ns |
23403 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2189392 ns |
2092475 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
219750 ns |
221416 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
217022 ns |
215732 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46167 ns |
46167 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45916 ns |
45875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46542 ns |
46458 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46042 ns |
46125 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
285352 ns |
290850 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11520821.5 ns |
11312959 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
946416.5 ns |
1043959 ns |
0.91 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
624726.5 ns |
616456 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56250 ns |
56333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57125 ns |
57042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57125 ns |
57167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57709 ns |
57958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28529 ns |
29487 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1225122 ns |
1284760 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
577958 ns |
609396 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206332 ns |
217177.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
598458 ns |
459916.5 ns |
1.30 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
604375.5 ns |
465375 ns |
1.30 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
610292 ns |
498229 ns |
1.22 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
609166.5 ns |
449000 ns |
1.36 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
450153 ns |
242456 ns |
1.86 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19947300 ns |
33776271 ns |
0.59 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9553437.5 ns |
9662458 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
984880.5 ns |
842873 ns |
1.17 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
648625 ns |
647916 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
647375 ns |
650791.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
617542 ns |
652979 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
636312 ns |
664625 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
204431.5 ns |
206160 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8163653 ns |
8473221 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1348000 ns |
1347124.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
232182 ns |
237013 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2233874.5 ns |
2259250 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2235250 ns |
2232542 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2243542 ns |
2224833 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2238083.5 ns |
2241083 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
956768 ns |
980993 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49121505 ns |
46835859 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7001041 ns |
7206958 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1380304 ns |
1391854 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19792 ns |
20083.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19833 ns |
20916.5 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21896 ns |
22625 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20833 ns |
21291.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111733 ns |
113434.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3625769.5 ns |
3471029 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1464666.5 ns |
1349084 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77731 ns |
75101 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
231708.5 ns |
220895.5 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219791.5 ns |
228042 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220833 ns |
238875 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
242250 ns |
219500 ns |
1.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
722782 ns |
734488 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26022488 ns |
26435758 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7764979.5 ns |
7569709 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
566586 ns |
566315 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
583 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
666 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
22661 ns |
23420 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1248244 ns |
1232395 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
419333 ns |
304916.5 ns |
1.38 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50031 ns |
49271 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
142416 ns |
9166.5 ns |
15.54 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
137875 ns |
9833 ns |
14.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
142583 ns |
10833 ns |
13.16 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
138500 ns |
9896 ns |
14.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
445482 ns |
263097 ns |
1.69 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
15914244 ns |
24623499 ns |
0.65 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5263708 ns |
5512291 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
546865 ns |
408514 ns |
1.34 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10000 ns |
10083 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8792 ns |
8209 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10709 ns |
10542 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9416 ns |
10541 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
118544.5 ns |
118922 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3536986 ns |
3616112 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
813750 ns |
828896 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
71570 ns |
72900.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7500 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7792 ns |
7333 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8000 ns |
7959 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7667 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
494002 ns |
513625 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17713765 ns |
17225102 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
3925291.5 ns |
3863708 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
335028.5 ns |
334143 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1584 ns |
1458.5 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1708 ns |
1541 ns |
1.11 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2125 ns |
2000 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1459 ns |
1354.5 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20784 ns |
20983 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1154727.5 ns |
1163032 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
294792 ns |
299208 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
195392 ns |
188466.5 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3334 ns |
3333 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3291 ns |
3375 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3500 ns |
3458 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3292 ns |
3334 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
218874.5 ns |
222731 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10553778.5 ns |
10495351.5 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1550604.5 ns |
1564958 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
595936 ns |
593390.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
147416.5 ns |
149500 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
127875 ns |
128375 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
130167 ns |
129542 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225084 ns |
225104 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23928 ns |
24535 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1233082 ns |
1174771 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
272167 ns |
264416 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
34770 ns |
36880 ns |
0.94 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
159708 ns |
159541 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
127375 ns |
138583 ns |
0.92 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
110458 ns |
138979.5 ns |
0.79 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
284500 ns |
266500 ns |
1.07 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
214034 ns |
218895 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10830578 ns |
10544510 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2013958.5 ns |
1984187.5 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
240127.5 ns |
220122.5 ns |
1.09 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7291 ns |
7292 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
6041 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5959 ns |
6000 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10208 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32360 ns |
33604 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1256519 ns |
1212259 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
343458 ns |
349062.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50810 ns |
52470 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
384687.5 ns |
256959 ns |
1.50 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
399708 ns |
230729 ns |
1.73 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
388770.5 ns |
238125 ns |
1.63 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
399666 ns |
223021 ns |
1.79 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
526360 ns |
258353 ns |
2.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29342862 ns |
29336585 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8734166.5 ns |
8308500 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
685707 ns |
529375 ns |
1.30 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14500 ns |
15979.5 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15292 ns |
14667 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
17084 ns |
16895.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14666 ns |
15625 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
138947.5 ns |
140165.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5806482 ns |
5484385 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
722000 ns |
722958.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
238282 ns |
238723 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23667 ns |
23250 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23916 ns |
23479 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23958 ns |
24125 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23125 ns |
23604 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
853569.5 ns |
877758 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40236184 ns |
39144107 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5415250 ns |
5343687.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
693447 ns |
692857 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9584 ns |
9375 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9416 ns |
9084 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12687.5 ns |
10979.5 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9500 ns |
9396 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
121436.5 ns |
122868.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3573217 ns |
3589561 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
809542 ns |
743916 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73381 ns |
70681 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13334 ns |
14583 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14208 ns |
13709 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14438 ns |
14125 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13708.5 ns |
13083 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
655124 ns |
673546 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21439559 ns |
21471684.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5342125 ns |
5194375 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
370548.5 ns |
370953 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9250 ns |
8833.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8959 ns |
8917 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12042 ns |
11500 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9375 ns |
9875 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
120539 ns |
121948.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3310613 ns |
4256003 ns |
0.78 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
875959 ns |
844125 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70721 ns |
69331 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12709 ns |
12729 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12958 ns |
12416 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13125 ns |
12792 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12541 ns |
12604.5 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
540883.5 ns |
557458 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18769186 ns |
20700383 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4285771 ns |
4280750 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
342373 ns |
345693 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
30417 ns |
30875.5 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
33937.5 ns |
34167 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
29917 ns |
31542 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
15983 ns |
16044 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
79550 ns |
74080 ns |
1.07 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5333.5 ns |
5375 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5188 ns |
5084 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5417 ns |
5459 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6666 ns |
6833.5 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
137115 ns |
140705.5 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
386874 ns |
367554 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25197 ns |
26129 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1365775.5 ns |
1277003 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
399583 ns |
290458 ns |
1.38 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
50531 ns |
48125.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
143958 ns |
7125 ns |
20.20 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
141958 ns |
6833 ns |
20.78 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
145771 ns |
7542 ns |
19.33 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
141666.5 ns |
7041 ns |
20.12 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
405976 ns |
192022.5 ns |
2.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24082892 ns |
22542195.5 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5201396 ns |
4967750 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
549066 ns |
392884 ns |
1.40 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
1958 ns |
2000 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
1958 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2042 ns |
2084 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1958 ns |
1959 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
25964 ns |
27301 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1226664.5 ns |
1311935 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
440479.5 ns |
441312.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
210742.5 ns |
207332 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
147458 ns |
17271 ns |
8.54 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
145583.5 ns |
16729 ns |
8.70 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
150146 ns |
17521 ns |
8.57 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
145208 ns |
16896 ns |
8.59 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
451191 ns |
269988 ns |
1.67 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
15538748.5 ns |
25410174 ns |
0.61 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5521833 ns |
5897125 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
847204 ns |
716817 ns |
1.18 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
178770.5 ns |
152500 ns |
1.17 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
176292 ns |
178084 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
153833 ns |
176437 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
150375 ns |
173584 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
199142.5 ns |
203100 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7652852.5 ns |
7890476 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1399625 ns |
1351292 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
198312 ns |
177192 ns |
1.12 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1326625 ns |
1312125 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1320083.5 ns |
1329791 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1328187.5 ns |
1323916 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1320541 ns |
1330917 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
894206 ns |
913760 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46708540.5 ns |
46233546.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6150333 ns |
6497521 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1134836.5 ns |
1120096 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25709 ns |
24729 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25916 ns |
24959 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27625 ns |
27000 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25875 ns |
25187.5 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
233752 ns |
236630.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8257548 ns |
7478324 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1008250 ns |
992375 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
120881 ns |
118741 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
158437.5 ns |
118687.5 ns |
1.33 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
182896 ns |
118229 ns |
1.55 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
118875 ns |
119041 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
117416 ns |
162729 ns |
0.72 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1061825.5 ns |
1083875 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46743020 ns |
46702412 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5873084 ns |
6050937.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
609097 ns |
602125 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22335 ns |
23359 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1169342 ns |
1296199 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
440167 ns |
438834 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
51530 ns |
48171 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
138000 ns |
7187.5 ns |
19.20 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
135458 ns |
7083 ns |
19.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
139250 ns |
7792 ns |
17.87 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
134896 ns |
7292 ns |
18.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
393303 ns |
198494 ns |
1.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25179341 ns |
24144346 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5137437.5 ns |
5445583 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
528076 ns |
399424 ns |
1.32 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6562 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6166 ns |
5916 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7833 ns |
7875 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5542 ns |
7312.5 ns |
0.76 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
148231 ns |
151846 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5722106.5 ns |
5647891 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
441646 ns |
439083 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
238152 ns |
235952 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
9833.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10166 ns |
9812.5 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
10333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10042 ns |
10000 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
888256.5 ns |
914543.5 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41828927 ns |
41915714.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5636292 ns |
5632541.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
678427 ns |
676221 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
666 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
666 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
666 ns |
667 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
666 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22088 ns |
22836 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2108789.5 ns |
2080700 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
222959 ns |
220167 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
215243 ns |
216072 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4542 ns |
4625 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4584 ns |
4500 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4791 ns |
4833 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4542 ns |
4625 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
222119.5 ns |
226643.5 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9884248 ns |
10356297.5 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1551583 ns |
1566604.5 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
600866 ns |
602506 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8292 ns |
7833.5 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8167 ns |
7833 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10167 ns |
10187.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7917 ns |
8375 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
119558 ns |
121209 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3445450 ns |
3579394 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
667562.5 ns |
718583.5 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69441 ns |
68421 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8209 ns |
8500 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8833 ns |
8250 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8959 ns |
8895.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
8250 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
581371.5 ns |
596725 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20320168.5 ns |
21444108.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4316250 ns |
4307083 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
349214 ns |
349143.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126917 ns |
126334 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
132875 ns |
129959 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130583 ns |
130375 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183042 ns |
183500 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
45635.5 ns |
46170 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
101126 ns |
98026 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
338166.5 ns |
339041 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
346812.5 ns |
314708.5 ns |
1.10 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
313875 ns |
331646 ns |
0.95 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
583145.5 ns |
568833.5 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
188975 ns |
193689 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
486375 ns |
486014.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397541 ns |
396875 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288312.5 ns |
288542 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288000 ns |
288416 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756125 ns |
756291 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43307.5 ns |
43814 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1369209 ns |
1380119 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
417000 ns |
406708 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83400 ns |
83471 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1456666.5 ns |
1458250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1135812 ns |
1132042 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1136958 ns |
1134667 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2446667 ns |
2445396 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
242723 ns |
253213.5 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11709621 ns |
11703870 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1775792 ns |
1847229.5 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
355854 ns |
352553 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
644646 ns |
649083 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
657333.5 ns |
647625 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
642583.5 ns |
653666 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
611437.5 ns |
653416 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
198699.5 ns |
195489.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8946325 ns |
8194184 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1346479.5 ns |
1355708 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
257148 ns |
256777.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2437416 ns |
2416000 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2444333 ns |
2443646 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2459917 ns |
2452084 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2446646 ns |
2458000 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
975703 ns |
1002651.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
53584726.5 ns |
52361708 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7227375 ns |
7303334 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1475180.5 ns |
1493424.5 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32917 ns |
32291.5 ns |
1.02 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
34958 ns |
37125 ns |
0.94 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34750 ns |
34959 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
875 ns |
958 ns |
0.91 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15505 ns |
15494 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
73541 ns |
78911 ns |
0.93 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3167 ns |
3125 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3271 ns |
3084 ns |
1.06 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3375 ns |
3333 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3041 ns |
3083.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
135878 ns |
140049.5 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
341823 ns |
357403 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406167 ns |
406729.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
408250 ns |
409167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
408750 ns |
408916 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420042 ns |
421291.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
42722 ns |
43852 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1392129 ns |
1426671 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1132750 ns |
1145916 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
240492.5 ns |
242907 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4010917 ns |
3884583 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4140708 ns |
3997395.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4151999.5 ns |
3997583.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3932958.5 ns |
3773396 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
429590.5 ns |
238021 ns |
1.80 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
40011947.5 ns |
37654037 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11642250 ns |
11658833 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1280768 ns |
1239152 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3916 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33393 ns |
33239.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1245305.5 ns |
1292975 ns |
0.96 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
176917 ns |
180000 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
41210 ns |
40870 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15708 ns |
15709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16000 ns |
15666 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15959 ns |
15958 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15583 ns |
15708 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
251876 ns |
255521 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
8767093.5 ns |
9126745 ns |
0.96 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
864834 ns |
866125 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
169502 ns |
168612 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404584 ns |
404125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295542 ns |
295459 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295458 ns |
295917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760375 ns |
760584 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113075 ns |
113603 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
999782 ns |
1047585 ns |
0.95 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
389917 ns |
403542 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89371 ns |
89361 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1490875 ns |
1484917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1158708 ns |
1156042 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1162750 ns |
1159666 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2503542 ns |
2467083.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
233635 ns |
253490 ns |
0.92 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12058492.5 ns |
11945518 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1844333 ns |
1868708 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
357394 ns |
357078 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
459 ns |
458 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25186 ns |
26377 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1362627.5 ns |
1231417 ns |
1.11 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
285833 ns |
289375 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
210112 ns |
208532 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
139125.5 ns |
8083 ns |
17.21 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
136708 ns |
8084 ns |
16.91 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
140166 ns |
8916 ns |
15.72 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
135875 ns |
8292 ns |
16.39 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
399319 ns |
205980.5 ns |
1.94 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25498054 ns |
25622754 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5342541.5 ns |
5187020.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
816958 ns |
699842 ns |
1.17 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
832125 ns |
831083.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
616208 ns |
618375 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
621417 ns |
620896 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1542583 ns |
1545833 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
132827 ns |
129305 ns |
1.03 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
170262 ns |
168801 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2690792 ns |
2685437.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1998875.5 ns |
1995875.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2004042 ns |
2004125 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4930500 ns |
4920541.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
239311 ns |
242604 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
865799 ns |
856188.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31389 ns |
32780 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1262909 ns |
1223978 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
299167 ns |
272646 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49080.5 ns |
48940 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
137834 ns |
7042 ns |
19.57 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
134834 ns |
6959 ns |
19.38 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
138834 ns |
7750 ns |
17.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
135000 ns |
7000 ns |
19.29 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
434137 ns |
221584 ns |
1.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21203255 ns |
21530894 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4873458 ns |
4625916 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
507435 ns |
370743 ns |
1.37 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2432250 ns |
2406167 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2409167 ns |
2385438 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2387833 ns |
2416791.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2396292 ns |
2423834 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189831 ns |
195765 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8179796 ns |
7988442 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1483000 ns |
1576520.5 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
362144 ns |
358903 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4633541.5 ns |
4642937.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4653062.5 ns |
4645584 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4675166.5 ns |
4665146 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4555521 ns |
4650250 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
886700 ns |
908920 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48864880.5 ns |
47384300 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6240563 ns |
6194875 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1267073 ns |
1416134 ns |
0.89 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7208 ns |
6812.5 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
14083.5 ns |
7187 ns |
1.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7500 ns |
8125 ns |
0.92 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6916 ns |
6979 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23218 ns |
23336 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1159818.5 ns |
1143574.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
262937.5 ns |
254646 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
40390 ns |
33330 ns |
1.21 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
51479 ns |
50833.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
48791.5 ns |
48375 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
50416.5 ns |
64916.5 ns |
0.78 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
47000 ns |
63979.5 ns |
0.73 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
214489 ns |
218640 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10649119.5 ns |
10406518.5 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2003958.5 ns |
2003333 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
223847 ns |
237612 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21459 ns |
21500 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
25062.5 ns |
26146 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24375 ns |
25583 ns |
0.95 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
6000 ns |
5250 ns |
1.14 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
16873 ns |
16818.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
86560 ns |
85311 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
11917 ns |
12167 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10416 ns |
10250 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10417 ns |
10792 ns |
0.97 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
17750 ns |
17916.5 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
225550 ns |
229003 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
374944 ns |
374873 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406166 ns |
406333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297084 ns |
297083 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296875 ns |
297291 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762250 ns |
762375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46316 ns |
46714 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1373094 ns |
1399660 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
410792 ns |
487958 ns |
0.84 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89721 ns |
91141 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1485562.5 ns |
1491333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1165333 ns |
1167145.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1166208 ns |
1166417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2514500 ns |
2472271 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
287030.5 ns |
283154 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11586937 ns |
13584197 ns |
0.85 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2073208 ns |
2090083.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
379943 ns |
380044 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
434375 ns |
434000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436958 ns |
437208 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436708 ns |
436875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447167 ns |
446459 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
53910.5 ns |
55157 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1015096 ns |
1013588 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1130542 ns |
1079041.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
237292 ns |
236567.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4070791.5 ns |
3895271 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4204896 ns |
3933625.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4209333 ns |
4028229.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3891208 ns |
3807292 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
512126 ns |
259800 ns |
1.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31246857.5 ns |
36848551 ns |
0.85 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10831209 ns |
10417062.5 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1294643 ns |
1238382.5 ns |
1.05 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8791 ns |
8708 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7709 ns |
7625 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7667 ns |
7666 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12375 ns |
12375 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23822 ns |
24111 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2157287 ns |
2073642 ns |
1.04 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
224666.5 ns |
222645.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
217992 ns |
216692 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45375 ns |
45375 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45334 ns |
45125 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45250 ns |
45250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45209 ns |
45292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
342231 ns |
347040 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13553832.5 ns |
13931322.5 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1724458.5 ns |
1692667 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
676147 ns |
670407 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
123500 ns |
89750 ns |
1.38 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90167 ns |
147041 ns |
0.61 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
89500 ns |
124792 ns |
0.72 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80833 ns |
126979.5 ns |
0.64 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190984 ns |
189741 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5773034.5 ns |
5754517.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1967042 ns |
1948042 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
186732 ns |
184172 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2013292 ns |
2022000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2014792 ns |
2017708 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2023604.5 ns |
2021875 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1964083 ns |
2017979.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
532292 ns |
539246.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31086820 ns |
28808272 ns |
1.08 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9598208.5 ns |
9259396 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
978310 ns |
1110411 ns |
0.88 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/warn_no_train
branch
from
August 29, 2024 21:51
fb000d0
to
bed2c87
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
initial part of #98. We start off with a warning but in 1.0, we will transition to an error
autodiff
EnzymeAD/Enzyme.jl#1761