This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
test: add tests comparing the fused op with unfused op #157
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/testing_dense
branch
from
September 10, 2024 19:48
f592a66
to
f11e57d
Compare
avik-pal
force-pushed
the
ap/testing_dense
branch
from
September 10, 2024 20:32
f11e57d
to
dba63b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: f11e57d | Previous: 40d9192 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6750 ns |
5666 ns |
1.19 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
7459 ns |
0.80 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7250 ns |
8458 ns |
0.86 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5750 ns |
7291 ns |
0.79 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119059 ns |
119078 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2735279 ns |
2538616 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
772709 ns |
702792 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
591165 ns |
427074 ns |
1.38 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9875.5 ns |
10020.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
9750 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10250 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9895.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
543283 ns |
551531 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17381908 ns |
18148603 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2589875 ns |
2222000 ns |
1.17 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
890074 ns |
679576 ns |
1.31 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
8437.5 ns |
1271 ns |
6.64 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
8083 ns |
2729 ns |
2.96 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
8792 ns |
1708.5 ns |
5.15 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
8042 ns |
1708.5 ns |
4.71 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
23722 ns |
21712 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1332230 ns |
1291875 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
214125 ns |
183666 ns |
1.17 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
30537 ns |
31345.5 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
5250 ns |
3500 ns |
1.50 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
3333 ns |
1.48 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4833 ns |
4208.5 ns |
1.15 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4959 ns |
4375 ns |
1.13 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
148252.5 ns |
146456.5 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9011509.5 ns |
8037303.5 ns |
1.12 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1295209 ns |
1510917 ns |
0.86 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
140623 ns |
146682 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58500 ns |
56500 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46937.5 ns |
46875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46875 ns |
46833 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84250 ns |
83459 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39503 ns |
36990 ns |
1.07 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
620600 ns |
664843 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1515416.5 ns |
1340625 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73577 ns |
80736 ns |
0.91 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2011875 ns |
2031000 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2063813 ns |
2086333.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2090208 ns |
2089292 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2004354 ns |
1995354 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
222303 ns |
232927.5 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8058610 ns |
7734526 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7562916 ns |
4323958 ns |
1.75 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1047899 ns |
1581446 ns |
0.66 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
173479.5 ns |
147042 ns |
1.18 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
176187.5 ns |
144625 ns |
1.22 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
155396 ns |
149833 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147500 ns |
151895.5 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165814 ns |
166087 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7631325.5 ns |
7754863 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1699333.5 ns |
1479250 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
159930 ns |
198942 ns |
0.80 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1118083 ns |
1120063 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1103083 ns |
1117666 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1108333 ns |
1115750 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1119646 ns |
1124875 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701220.5 ns |
721156.5 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37668765 ns |
33562933.5 ns |
1.12 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6581625 ns |
6149062.5 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
911594 ns |
1022579 ns |
0.89 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4333 ns |
4166 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4646 ns |
5041.5 ns |
0.92 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5729 ns |
6042 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4312.5 ns |
6250 ns |
0.69 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
92104.5 ns |
95202.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5286174 ns |
5313078 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
597770.5 ns |
416333.5 ns |
1.44 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
60844 ns |
65661 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8834 ns |
9000 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8917 ns |
8709 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
9375 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8583 ns |
8417 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
597427 ns |
618225 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
35089178 ns |
31699887 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5972625 ns |
5433375 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
373409 ns |
388724 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17958.5 ns |
16229.5 ns |
1.11 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18208.5 ns |
17500 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21958 ns |
21916 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17500 ns |
18542 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
66685 ns |
68340 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3162792 ns |
3114761 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1280875 ns |
455354.5 ns |
2.81 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75982 ns |
75821 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222292 ns |
213125 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
225458 ns |
212125 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
215042 ns |
214749.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218916.5 ns |
223791 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
355209 ns |
361191 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
13835720 ns |
13957207 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5582750 ns |
5399125 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
424704 ns |
468614 ns |
0.91 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
10541.5 ns |
625 ns |
16.87 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
8667 ns |
667 ns |
12.99 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
10167 ns |
875 ns |
11.62 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
7812.5 ns |
708 ns |
11.03 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
22860 ns |
20782 ns |
1.10 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1259883 ns |
1176905 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
303958 ns |
179000 ns |
1.70 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
28744 ns |
31201 ns |
0.92 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2625 ns |
1458 ns |
1.80 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2333 ns |
1500 ns |
1.56 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2375 ns |
1541 ns |
1.54 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2209 ns |
1333.5 ns |
1.66 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
127050.5 ns |
128010.5 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9494797.5 ns |
9057994 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1576500 ns |
1474521 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
117880.5 ns |
136491 ns |
0.86 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14834 ns |
7333 ns |
2.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
13333 ns |
6166 ns |
2.16 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
14375 ns |
6166 ns |
2.33 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16542 ns |
10291 ns |
1.61 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33557 ns |
24318 ns |
1.38 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1327266 ns |
1193537 ns |
1.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
572917 ns |
341583 ns |
1.68 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
64270 ns |
47631 ns |
1.35 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
239208 ns |
231125 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
278959 ns |
270583 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
271521.5 ns |
270375 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257834 ns |
213167 ns |
1.21 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
201499.5 ns |
195209.5 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31139677 ns |
31467862 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9587292 ns |
9233666 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
601625 ns |
645516 ns |
0.93 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4209 ns |
4125 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
22857 ns |
23938.5 ns |
0.95 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2143984 ns |
2014824 ns |
1.06 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
223771 ns |
210750 ns |
1.06 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
42640 ns |
48021 ns |
0.89 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
21667 ns |
16916 ns |
1.28 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
21542 ns |
17417 ns |
1.24 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
21833 ns |
17208 ns |
1.27 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
20750 ns |
16667 ns |
1.24 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
205962 ns |
198962 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
11331316 ns |
10294946 ns |
1.10 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
947084 ns |
900625 ns |
1.05 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
180478 ns |
172967 ns |
1.04 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
512771 ns |
508125 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405875 ns |
404416 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
406583 ns |
404792 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865042 ns |
865375 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113394 ns |
113291 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
411956 ns |
429336 ns |
0.96 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
437042 ns |
432708 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
396972 ns |
242113 ns |
1.64 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2313666.5 ns |
2329437 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2033125 ns |
2034750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2042917 ns |
2031750 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3198833 ns |
3193375 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
255204.5 ns |
246406 ns |
1.04 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11104817 ns |
12521873.5 ns |
0.89 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1952000 ns |
1893250 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
608372.5 ns |
744268 ns |
0.82 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6979 ns |
5187.5 ns |
1.35 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6562.5 ns |
7083 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7875 ns |
7354 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6771 ns |
7542 ns |
0.90 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
87915.5 ns |
93165 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5601042.5 ns |
5491281 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
765292 ns |
752833 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
60874 ns |
65211 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11208 ns |
12167 ns |
0.92 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12500 ns |
11792 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12375 ns |
12374.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10958 ns |
11396 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
612766 ns |
647871 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38641140 ns |
39284056 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5719500 ns |
5190667 ns |
1.10 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
393075.5 ns |
411409 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
3125 ns |
500 ns |
6.25 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
2917 ns |
541 ns |
5.39 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
3084 ns |
500 ns |
6.17 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
2708 ns |
500 ns |
5.42 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
30967 ns |
23724 ns |
1.31 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2285306 ns |
2212056 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
231333 ns |
204584 ns |
1.13 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
48601 ns |
47141 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
12583 ns |
2125 ns |
5.92 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
11875 ns |
2125 ns |
5.59 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
11834 ns |
2167 ns |
5.46 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
10584 ns |
2125 ns |
4.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
241976 ns |
227021 ns |
1.07 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
12085430.5 ns |
11087876.5 ns |
1.09 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1942916.5 ns |
1921834 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
190887 ns |
172882 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
30125 ns |
8208 ns |
3.67 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
30250 ns |
9146 ns |
3.31 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
35999.5 ns |
9959 ns |
3.61 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
27500 ns |
8375 ns |
3.28 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
107239.5 ns |
104776 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3168249.5 ns |
3291769.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
906417 ns |
468500 ns |
1.93 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
78697 ns |
72700.5 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25312.5 ns |
17374.5 ns |
1.46 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
24334 ns |
18625 ns |
1.31 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
25062 ns |
18250 ns |
1.37 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
24354 ns |
18125 ns |
1.34 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
574746 ns |
580515 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
19314204 ns |
17620571 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5293792 ns |
4970938 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
372506.5 ns |
381279 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2167 ns |
459 ns |
4.72 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
1959 ns |
584 ns |
3.35 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2125 ns |
625 ns |
3.40 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
1791 ns |
458 ns |
3.91 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
40056 ns |
35839 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1228474 ns |
1218575 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
289687.5 ns |
423541 ns |
0.68 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45445 ns |
46311 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11792 ns |
9104 ns |
1.30 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11917 ns |
9333 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11520.5 ns |
9083 ns |
1.27 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11000 ns |
9208 ns |
1.19 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
258486.5 ns |
261166 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18348508 ns |
18752145 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4878041.5 ns |
4335125 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
364537 ns |
367929 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396667 ns |
395708 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287229 ns |
288375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288479.5 ns |
288375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756334 ns |
756292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111990 ns |
111964.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
336277 ns |
329610 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
367729.5 ns |
303771 ns |
1.21 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
77224 ns |
75611 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1472083.5 ns |
1445541 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1139895.5 ns |
1129292 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1142896 ns |
1133875 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2361791.5 ns |
2356333 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
221976.5 ns |
210839 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11123008 ns |
10091107 ns |
1.10 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1662645.5 ns |
1639416 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
305927.5 ns |
322414 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7125 ns |
7042 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7770.5 ns |
8000 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8520.5 ns |
8833.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7792 ns |
7520.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
138464.5 ns |
142989 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5761431 ns |
5929780 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
724416 ns |
470791.5 ns |
1.54 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
63489 ns |
66011 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14417 ns |
16208 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15229 ns |
14250 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15167 ns |
16000 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15042 ns |
15354.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
957624 ns |
963872.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44047156 ns |
42665593.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5683459 ns |
5541125 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
414450 ns |
426829 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26208 ns |
24458 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
26208.5 ns |
26062.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
30041 ns |
29916.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25458 ns |
25708.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
197573 ns |
202495.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7876804.5 ns |
8124671 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
975021 ns |
985584 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
157760 ns |
114461 ns |
1.38 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
105062.5 ns |
109083 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
150834 ns |
152250 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
112167 ns |
152854 ns |
0.73 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
146375 ns |
142750 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1069177 ns |
1066908 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42327752 ns |
41393438 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5806479 ns |
5472042 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
553606 ns |
588251 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
81687.5 ns |
75167 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
78792 ns |
74583 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
80771 ns |
84375 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74979.5 ns |
74125 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
204985 ns |
208606 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8119657 ns |
7473638 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
533312.5 ns |
500875 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
124813.5 ns |
129022 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
251458 ns |
304417 ns |
0.83 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
298520.5 ns |
302145.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
300354.5 ns |
267604 ns |
1.12 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
318041 ns |
221146.5 ns |
1.44 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1109883 ns |
1119561.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44130956.5 ns |
40462234 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6429916 ns |
6061271 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
642250 ns |
695387 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
17125.5 ns |
15729.5 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
16916 ns |
17541 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18354.5 ns |
18000 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
17187.5 ns |
17000 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
144430.5 ns |
148248.5 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
6119082 ns |
5730909.5 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
583667 ns |
745333 ns |
0.78 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
225772 ns |
232902 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25292 ns |
26937 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27708.5 ns |
26291.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27792 ns |
27291 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
28167 ns |
26833.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
969396 ns |
995021 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41222701.5 ns |
39941943 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5774021 ns |
5463292 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
641800 ns |
692327 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
37375 ns |
10375 ns |
3.60 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
33937.5 ns |
11875 ns |
2.86 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
43000 ns |
12562 ns |
3.42 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
30521 ns |
11625 ns |
2.63 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
138411.5 ns |
125968 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3833959.5 ns |
3534875 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
850833 ns |
849958 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
247072 ns |
236132 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
24291 ns |
22292 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
24125 ns |
21542 ns |
1.12 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
25458 ns |
23416 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
23750 ns |
22459 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
701835.5 ns |
709781 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21171059 ns |
21081902.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5379104 ns |
5312812.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
578882 ns |
671626 ns |
0.86 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
64104.5 ns |
63000 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
64208 ns |
64875 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
67187.5 ns |
67624.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
65667 ns |
70792 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105929.5 ns |
108732 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3410310 ns |
3570568 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1311250 ns |
463166.5 ns |
2.83 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
238446 ns |
233653 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
442604 ns |
437250 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
485687 ns |
448250 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
444375 ns |
451208 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
450229 ns |
443667 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
511443 ns |
523839.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21511920 ns |
20377781.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5894125 ns |
6056791 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
645166 ns |
715783 ns |
0.90 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7625 ns |
7104.5 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7479.5 ns |
8125 ns |
0.92 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8750 ns |
8333 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7417 ns |
7729.5 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
143471 ns |
147799 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5455985 ns |
5614298 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
694542 ns |
704750 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
61425 ns |
65321 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13708 ns |
14500 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13812.5 ns |
15437.5 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15042 ns |
14833 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15834 ns |
14146 ns |
1.12 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
936407 ns |
966324 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
39926329 ns |
36660688 ns |
1.09 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5583959 ns |
5256874.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
394929 ns |
400984 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6153667 ns |
6153708 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6372917 ns |
6380458 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6368479 ns |
6380979.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11917041 ns |
11947959 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
350729 ns |
301662 ns |
1.16 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
394713.5 ns |
322583 ns |
1.22 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19134500 ns |
19056521 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19938458 ns |
19941000 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19938625 ns |
19981146 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36508062.5 ns |
36490833.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1138227 ns |
1026590 ns |
1.11 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1139856 ns |
1153502 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
4041 ns |
917 ns |
4.41 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
3416 ns |
959 ns |
3.56 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
3750 ns |
1000 ns |
3.75 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
3292 ns |
958 ns |
3.44 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
30429 ns |
23570 ns |
1.29 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2237968 ns |
2101433 ns |
1.06 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
233792 ns |
203000 ns |
1.15 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
209867.5 ns |
207632 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
11208 ns |
3708 ns |
3.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
12167 ns |
3791 ns |
3.21 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
13041 ns |
3792 ns |
3.44 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
11375 ns |
3750 ns |
3.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
313879 ns |
284692.5 ns |
1.10 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11584761 ns |
11502827.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2157229.5 ns |
2063354 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
488504 ns |
625846 ns |
0.78 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
25999.5 ns |
7208 ns |
3.61 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
33479 ns |
8500 ns |
3.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
43250 ns |
9292 ns |
4.65 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
26333 ns |
8250 ns |
3.19 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
137461 ns |
122668.5 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3438795 ns |
3715127.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
799584 ns |
787166 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
80640 ns |
72740 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
17354 ns |
11875 ns |
1.46 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
19291 ns |
12750 ns |
1.51 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
19875 ns |
12583 ns |
1.58 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
18125 ns |
12500 ns |
1.45 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
666548 ns |
651999 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20922356.5 ns |
22144306 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
4784875 ns |
4276208 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
359342.5 ns |
359014 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
334 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22472 ns |
22720.5 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2158627.5 ns |
2075647.5 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
227854.5 ns |
205083 ns |
1.11 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
42420 ns |
47440 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
6209 ns |
2875 ns |
2.16 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
7209 ns |
3500 ns |
2.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
7792 ns |
3333 ns |
2.34 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
6584 ns |
3208 ns |
2.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
220697 ns |
206663 ns |
1.07 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9979049 ns |
9232071 ns |
1.08 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1643104.5 ns |
1552875 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
160089.5 ns |
156172 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12354.5 ns |
10083 ns |
1.23 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11791.5 ns |
11083 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13312.5 ns |
12458 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10958 ns |
11708 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
123366 ns |
123476 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3431504 ns |
3456473.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
889208 ns |
861479.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
234880 ns |
236062 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20542 ns |
20604 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21333 ns |
23187.5 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21750 ns |
23333 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21167 ns |
21042 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
598497.5 ns |
607311 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19557859 ns |
20290582.5 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4742292 ns |
4254667 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
574605 ns |
645431.5 ns |
0.89 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
7125 ns |
4458 ns |
1.60 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
6917 ns |
4500 ns |
1.54 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
7375 ns |
4417 ns |
1.67 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
6833 ns |
4500 ns |
1.52 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
32096 ns |
24732 ns |
1.30 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2367486 ns |
2177168 ns |
1.09 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223083.5 ns |
211459 ns |
1.05 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52959 ns |
47591 ns |
1.11 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
25667 ns |
16375 ns |
1.57 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
27542 ns |
16834 ns |
1.64 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
27500 ns |
16458 ns |
1.67 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
25833 ns |
16083 ns |
1.61 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
359637.5 ns |
332546.5 ns |
1.08 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
13316554 ns |
12988178 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1347791 ns |
1511750 ns |
0.89 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
213669 ns |
208322 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
3375 ns |
2084 ns |
1.62 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
3500 ns |
2041 ns |
1.71 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
3708 ns |
2167 ns |
1.71 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3417 ns |
2209 ns |
1.55 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
41395 ns |
36551 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1212620.5 ns |
1147028 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
296083 ns |
268042 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
205695 ns |
204212 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
18291.5 ns |
17396 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
19041.5 ns |
17250 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
20354 ns |
17812.5 ns |
1.14 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20791 ns |
19479 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
292894 ns |
297836 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20954498 ns |
21470855.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5026958 ns |
5022375 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
614799.5 ns |
686617 ns |
0.90 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
60167 ns |
56395.5 ns |
1.07 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65750 ns |
65083 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65458 ns |
66250 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
53646 ns |
51333 ns |
1.05 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66510 ns |
66767.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
108798.5 ns |
115211 ns |
0.94 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
188417 ns |
197187.5 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
113292 ns |
163417 ns |
0.69 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
126854 ns |
163937.5 ns |
0.77 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
305229 ns |
315500 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
220047 ns |
219712.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
527527 ns |
611147 ns |
0.86 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143208 ns |
105333 ns |
1.36 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
83187.5 ns |
81834 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
84875 ns |
86959 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82416.5 ns |
86750 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190604 ns |
191740.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5436076.5 ns |
5593567.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1857292 ns |
2535645.5 ns |
0.73 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
206837 ns |
204172 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1890208 ns |
1915521 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1880729.5 ns |
1914333 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1667062.5 ns |
1911750 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1763312.5 ns |
1879292 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
534577 ns |
538609 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26696643 ns |
24792062.5 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9225937.5 ns |
8911395.5 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
879505 ns |
1067201 ns |
0.82 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
2625 ns |
292 ns |
8.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
2708 ns |
292 ns |
9.27 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
3084 ns |
292 ns |
10.56 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
2542 ns |
333 ns |
7.63 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
28896 ns |
22127 ns |
1.31 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2115362 ns |
2111782 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
336500 ns |
320417 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
43972 ns |
41970 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
10125 ns |
1792 ns |
5.65 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
12083 ns |
1875 ns |
6.44 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
13584 ns |
1875 ns |
7.24 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
11541 ns |
1875 ns |
6.16 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
280701 ns |
255417.5 ns |
1.10 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
10783143 ns |
10493115 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1321833 ns |
1487041 ns |
0.89 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
196998 ns |
183032 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10187.5 ns |
7375 ns |
1.38 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
10208.5 ns |
9562.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11084 ns |
11250 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
10500 ns |
11333 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119766 ns |
121634 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3544552 ns |
3330370 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
859750.5 ns |
831000 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
236352 ns |
235863 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9021 ns |
8958 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10167 ns |
10917 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9708 ns |
11542 ns |
0.84 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9292 ns |
9250 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
528594 ns |
536196 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20308774 ns |
20906072 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4428479.5 ns |
3661104.5 ns |
1.21 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
556110 ns |
620146.5 ns |
0.90 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
65208 ns |
56833 ns |
1.15 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
55625 ns |
46333 ns |
1.20 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
58792 ns |
47000 ns |
1.25 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
89875 ns |
83417 ns |
1.08 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
50195 ns |
40185 ns |
1.25 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1487606.5 ns |
1391043 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1138750 ns |
1150167 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
82043 ns |
77886 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1861958 ns |
1925959 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1984667 ns |
1932875 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1968666.5 ns |
1975666 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1901542 ns |
1853417 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
235593 ns |
224336 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
34412214 ns |
33169959 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11173167 ns |
11254125 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1019646 ns |
1176553 ns |
0.87 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
419250.5 ns |
416209 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
418458 ns |
418021.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421792 ns |
423500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
418166 ns |
417709 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
211672.5 ns |
212391.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8780508 ns |
7928224 ns |
1.11 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
532833 ns |
501042 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
269003 ns |
283733 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
670000 ns |
689875.5 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
683417 ns |
744770.5 ns |
0.92 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
674958 ns |
684250 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
674541 ns |
683020.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1054744 ns |
1071393 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45045495 ns |
45538634 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6450000 ns |
6134687.5 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
862543 ns |
911264.5 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3451396 ns |
3426041.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3392854.5 ns |
3415458.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3412042 ns |
3440084 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3434750 ns |
3459083 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
172559 ns |
174794 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8346126 ns |
8045126 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1390000 ns |
1391250 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
666837 ns |
426850 ns |
1.56 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6190917 ns |
6168667 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6183937.5 ns |
6210416 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6169479.5 ns |
6205709 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6221625 ns |
6247562.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1001720 ns |
1017240 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48855796 ns |
50293396 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7250416 ns |
7732791.5 ns |
0.94 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1506478.5 ns |
1542501 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
474333 ns |
473291 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
343083 ns |
342875 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
344062.5 ns |
341396 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
906521 ns |
901791 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
54961 ns |
46836 ns |
1.17 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
405228 ns |
381391 ns |
1.06 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
420833.5 ns |
354270.5 ns |
1.19 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
406260 ns |
243143 ns |
1.67 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2312792 ns |
2332208 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2037666 ns |
2034354.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2040458 ns |
2036500 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3206833 ns |
3194416 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
303345 ns |
273644.5 ns |
1.11 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
15525931 ns |
15628377 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2177250 ns |
2136645.5 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
644946 ns |
772838 ns |
0.83 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
63292 ns |
56292 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
55084 ns |
45834 ns |
1.20 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
52812.5 ns |
46125 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90084 ns |
83209 ns |
1.08 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38580 ns |
28601 ns |
1.35 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1404175.5 ns |
1335147 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1161541 ns |
1124979 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
82675 ns |
74305.5 ns |
1.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1921959 ns |
2016104.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2011209 ns |
2087291 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1782958 ns |
2087917 ns |
0.85 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2014334 ns |
1975958.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
248887 ns |
240545 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
38486330 ns |
37474096 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11361792 ns |
11883709 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1052699 ns |
1048951 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58250 ns |
56542 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
48250 ns |
46354.5 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
48125 ns |
46666.5 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84709 ns |
83750 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
55882 ns |
50752 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
833022 ns |
835807 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1106333 ns |
1048667 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73938 ns |
78556 ns |
0.94 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1895166 ns |
1921000 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1977250 ns |
1952958.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1934667 ns |
1973000 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1896166.5 ns |
1862417 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
249033 ns |
246729 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17897206 ns |
16959227 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9818416 ns |
9957875 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
961798.5 ns |
1034211 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1542 ns |
292 ns |
5.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1875 ns |
416 ns |
4.51 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1542 ns |
416 ns |
3.71 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1458 ns |
292 ns |
4.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
40187 ns |
35694 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1298014 ns |
1211794.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
436354 ns |
311771 ns |
1.40 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47208 ns |
46570 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8479.5 ns |
6604.5 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9083 ns |
7291.5 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8083 ns |
6666 ns |
1.21 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
6709 ns |
1.19 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
216055 ns |
213644 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21272220.5 ns |
21642370 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5159042 ns |
4349083.5 ns |
1.19 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
361170.5 ns |
366543.5 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32391 ns |
32948 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1244615 ns |
1191915 ns |
1.04 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
251500 ns |
153792 ns |
1.64 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
36889 ns |
39081 ns |
0.94 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
6209 ns |
3208 ns |
1.94 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
7792 ns |
3041 ns |
2.56 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
6667 ns |
3083 ns |
2.16 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
6541 ns |
3083 ns |
2.12 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
206420 ns |
193915 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9181253 ns |
7217530 ns |
1.27 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
939167 ns |
894250 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
154709 ns |
158472 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
445041.5 ns |
420583.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
455500 ns |
420833.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
447667 ns |
456166.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
448542 ns |
426229 ns |
1.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
143863 ns |
140216.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6251430 ns |
6258248 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2006875 ns |
2682604 ns |
0.75 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
465741 ns |
367294 ns |
1.27 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3792292 ns |
3811479 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3712500 ns |
3798000 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3740645.5 ns |
3806125 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3681542 ns |
3813437.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
718109 ns |
724543 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33048296.5 ns |
32785400 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10430124.5 ns |
10852833 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1140363 ns |
1313993.5 ns |
0.87 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49865562.5 ns |
49807062.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35500000 ns |
35521583 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35527834 ns |
35517479 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97086500 ns |
97112834 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1593681 ns |
1611615 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1577160 ns |
1049140 ns |
1.50 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154458541.5 ns |
153740041.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112337791.5 ns |
112306083 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112343542 ns |
112476667 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
295487729 ns |
295356541 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6504453 ns |
6485483 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5864859 ns |
5555702 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
21250 ns |
15041.5 ns |
1.41 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
21375 ns |
18375 ns |
1.16 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
20270.5 ns |
16083 ns |
1.26 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
23000 ns |
15646 ns |
1.47 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
22923.5 ns |
21271 ns |
1.08 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1127379 ns |
1120492.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
223667 ns |
200000 ns |
1.12 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
28513 ns |
27480 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
11791 ns |
10666.5 ns |
1.11 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
10000 ns |
9042 ns |
1.11 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
10188 ns |
9437.5 ns |
1.08 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
18333.5 ns |
17042 ns |
1.08 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
261557 ns |
267724 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9750936.5 ns |
10072145 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1567125 ns |
1541750 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
141680 ns |
148171 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
26958 ns |
7709 ns |
3.50 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
26041.5 ns |
8709 ns |
2.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
29833 ns |
10708 ns |
2.79 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
27917 ns |
9708.5 ns |
2.88 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
140387.5 ns |
129031 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3651619.5 ns |
3486446 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
803646 ns |
797791 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
244217 ns |
234732 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11584 ns |
10458.5 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11625 ns |
9833 ns |
1.18 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11541.5 ns |
11333.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11416 ns |
9125 ns |
1.25 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
629099.5 ns |
638866 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23887109 ns |
21816663 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
4918854 ns |
4208187.5 ns |
1.17 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
580556 ns |
651461.5 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9417 ns |
8625.5 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9979 ns |
9729 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11583 ns |
11521 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9334 ns |
11042 ns |
0.85 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
121984.5 ns |
123974 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3546531 ns |
3315044 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
896062.5 ns |
859750 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71303 ns |
72471 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
14187.5 ns |
17583 ns |
0.81 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13750 ns |
13458 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13584 ns |
15166 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14125 ns |
13083 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
595508 ns |
608117 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21472632.5 ns |
18976850.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4571041 ns |
3989167 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
338302 ns |
346933 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1791 ns |
541 ns |
3.31 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1750 ns |
625 ns |
2.80 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1791 ns |
625 ns |
2.87 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1667 ns |
584 ns |
2.85 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
40005 ns |
35726 ns |
1.12 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1233960 ns |
1170850 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
275083.5 ns |
255917 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
207819 ns |
204512 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
8604.5 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9437.5 ns |
7625 ns |
1.24 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9250 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8958 ns |
7584 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
230455 ns |
237837 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
23652437 ns |
23133813.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4903187.5 ns |
4454021 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
597027 ns |
654907 ns |
0.91 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
17937 ns |
12208 ns |
1.47 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
18604 ns |
16208 ns |
1.15 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
18792 ns |
15542 ns |
1.21 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
18542 ns |
10229 ns |
1.81 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
24371 ns |
22887 ns |
1.06 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1228812.5 ns |
1146280 ns |
1.07 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
209292 ns |
183250 ns |
1.14 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
190396.5 ns |
190602 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32875 ns |
31917 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32958 ns |
32334 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
33125 ns |
32334 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32875 ns |
31792 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
280116 ns |
282370 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11861919 ns |
12675054 ns |
0.94 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1676666.5 ns |
1664375 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
547659 ns |
592261 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
441854.5 ns |
445708 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
475312.5 ns |
440416 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
444000 ns |
446125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
443166 ns |
462250 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194868 ns |
194079.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6147280 ns |
6009981 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1962250 ns |
1948750 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
365368.5 ns |
368473 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3714333 ns |
3828708 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3747417 ns |
3827249.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3455458.5 ns |
3829459 ns |
0.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3835083 ns |
3834708 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
543251.5 ns |
555671 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29951646 ns |
28291601.5 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8981208.5 ns |
9332833 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1128911 ns |
1362449 ns |
0.83 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
831451229 ns |
836902583.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
543434250 ns |
545812333 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
542523083 ns |
552742958 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1509475541 ns |
1515431791 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22745955.5 ns |
22773250.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
10453226 ns |
14681704 ns |
0.71 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3556189000 ns |
3618929167 ns |
0.98 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
2979880625 ns |
1786520209 ns |
1.67 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1804545917 ns |
1811380625 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4761240750 ns |
4749890834 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
370586856 ns |
371829328 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
67278654 ns |
89064682 ns |
0.76 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75875 ns |
75813 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77521 ns |
76708 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79125 ns |
79437 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
78709 ns |
76979 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
209735 ns |
213831.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7790930 ns |
7889207 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
525125 ns |
504291 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
141925.5 ns |
107541 ns |
1.32 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
193708 ns |
268729 ns |
0.72 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
280750 ns |
283625 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
197542 ns |
204145.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
197375 ns |
192875 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1052688.5 ns |
1071904.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47347171 ns |
42887765 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6146625 ns |
5838812.5 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
618917 ns |
632041 ns |
0.98 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199650333.5 ns |
199435500 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139206333 ns |
139086375 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139463625 ns |
139238083 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
389189291 ns |
389003125 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5834182 ns |
5834940 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
2538622.5 ns |
3577266 ns |
0.71 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619678062.5 ns |
616747896 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
441556083 ns |
438910291 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439833958.5 ns |
439344770.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1179733292 ns |
1178749375 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26679227 ns |
26592537.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
16136881 ns |
22013573 ns |
0.73 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14208 ns |
7292 ns |
1.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
12792 ns |
6291 ns |
2.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
13292 ns |
6250 ns |
2.13 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16708 ns |
9959 ns |
1.68 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37547.5 ns |
28590.5 ns |
1.31 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1347839 ns |
1242816 ns |
1.08 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
632375 ns |
342708 ns |
1.85 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
65412.5 ns |
46790 ns |
1.40 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225229.5 ns |
214875 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228229.5 ns |
220542 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229375 ns |
223250 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221792 ns |
207000 ns |
1.07 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
235566.5 ns |
227888 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
34095690 ns |
32088566 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9069291.5 ns |
9056958 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
583156 ns |
532636 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10709 ns |
7500 ns |
1.43 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9667 ns |
8459 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10812.5 ns |
11166 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10125 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
119996.5 ns |
120432.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3386801 ns |
3400864 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
865208 ns |
833917 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70562 ns |
69170 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8208 ns |
11687 ns |
0.70 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8250 ns |
7875 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7834 ns |
9083 ns |
0.86 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8042 ns |
7791.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
522508.5 ns |
540200 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19977756 ns |
19905821.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4494666 ns |
3738000 ns |
1.20 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
328765 ns |
316443 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7208 ns |
500 ns |
14.42 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7084 ns |
500 ns |
14.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7167 ns |
583 ns |
12.29 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7125 ns |
500 ns |
14.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
35632 ns |
26859 ns |
1.33 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1285621 ns |
1218948 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
313959 ns |
487291.5 ns |
0.64 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
53540 ns |
46600 ns |
1.15 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16833 ns |
12042 ns |
1.40 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16417 ns |
9500 ns |
1.73 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16812.5 ns |
10666 ns |
1.58 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17417 ns |
9375 ns |
1.86 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
267006.5 ns |
259067.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24406042 ns |
22720833.5 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5634958 ns |
5032208 ns |
1.12 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
418588 ns |
388914 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
115145.5 ns |
105209 ns |
1.09 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
106812 ns |
98958.5 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
110417 ns |
100666 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
154333 ns |
146584 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
27538 ns |
26010 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1258504.5 ns |
1202311.5 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
264333.5 ns |
239416 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
210464 ns |
191122 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
519625 ns |
478959 ns |
1.08 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
480604 ns |
490458 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
480083 ns |
483458 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
479833.5 ns |
519792 ns |
0.92 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
236269 ns |
238157 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11815219 ns |
11712742 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2113625 ns |
2063166.5 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
580731 ns |
609226.5 ns |
0.95 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5312.5 ns |
5459 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6520.5 ns |
6937.5 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7375 ns |
6708 ns |
1.10 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4417 ns |
4479 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16222 ns |
17171 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
73207 ns |
84830 ns |
0.86 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
13167 ns |
12709 ns |
1.04 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
11688 ns |
11208.5 ns |
1.04 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
12000 ns |
11979.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17708 ns |
16792 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
219499 ns |
219500 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
331220 ns |
367374 ns |
0.90 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
40250 ns |
35250 ns |
1.14 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51562.5 ns |
51958 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52334 ns |
53333 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13625 ns |
13792 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
19967 ns |
22473 ns |
0.89 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
93615 ns |
87211 ns |
1.07 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
38042 ns |
37208 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
32479 ns |
30979 ns |
1.05 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32479 ns |
32729.5 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
58542 ns |
57375 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
198501 ns |
198883 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
372867 ns |
411165 ns |
0.91 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
9708 ns |
1708 ns |
5.68 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
9791.5 ns |
1917 ns |
5.11 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
10667 ns |
2208 ns |
4.83 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
10792 ns |
2020.5 ns |
5.34 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
22678 ns |
20890 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1185948.5 ns |
1182894 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
306541.5 ns |
198895.5 ns |
1.54 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
29910.5 ns |
34491 ns |
0.87 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
3208 ns |
2250 ns |
1.43 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
3208 ns |
2125 ns |
1.51 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
3208 ns |
2541 ns |
1.26 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
3250 ns |
2375 ns |
1.37 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
206737.5 ns |
209350.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
8767043 ns |
9223044 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1506229.5 ns |
1571458 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
126807 ns |
137241 ns |
0.92 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5000 ns |
3979.5 ns |
1.26 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4833.5 ns |
4916 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6333 ns |
6167 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5562.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
145367 ns |
148854.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5734563 ns |
5416916 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
444458 ns |
433541 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
62777 ns |
69351 ns |
0.91 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8562.5 ns |
8958 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8791 ns |
8584 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8791.5 ns |
9375 ns |
0.94 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8209 ns |
8208 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
879051 ns |
901778 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
39241108 ns |
39101068.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5552875 ns |
5296271 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
374441 ns |
390164 ns |
0.96 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
58292 ns |
56792 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
59125 ns |
57792 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
58917 ns |
57667 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
60042 ns |
58625 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
43183 ns |
38676 ns |
1.12 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1223886 ns |
1256024 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
430916.5 ns |
328000 ns |
1.31 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
207588 ns |
204982 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
451104 ns |
454396 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
468729.5 ns |
464875 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
466667 ns |
465042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
435917 ns |
433750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
266055.5 ns |
274516.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27322606 ns |
27766998 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8068750 ns |
7963542 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
777504 ns |
840618 ns |
0.92 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3310708 ns |
3290875 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2333646 ns |
2340916.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2333854 ns |
2344208.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6332209 ns |
6314083.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204884.5 ns |
205766 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
460746.5 ns |
213542 ns |
2.16 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11462875 ns |
11352771 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8327520.5 ns |
8308208 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8340084 ns |
8331229.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21238333.5 ns |
21159458.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
730082 ns |
735602 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
2032366 ns |
1058910.5 ns |
1.92 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5833.5 ns |
3542 ns |
1.65 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
6646 ns |
0.89 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7375 ns |
7333 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6750 ns |
6875 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
138042.5 ns |
141882 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5746357.5 ns |
5384644 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
742000 ns |
792000 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
62888 ns |
56381 ns |
1.12 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
9458 ns |
0.81 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7458 ns |
7583.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7666 ns |
7250 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7791.5 ns |
7458 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
759439.5 ns |
774451.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
37720178 ns |
37102116 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5234834 ns |
5116062.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
366936 ns |
368734 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
114708.5 ns |
95500 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
119667 ns |
95041 ns |
1.26 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
121250 ns |
101334 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
116791.5 ns |
96958 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
149606 ns |
153183 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5876370 ns |
5925151 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2026958 ns |
2007167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
244046.5 ns |
218112 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1934334 ns |
2021874.5 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2021750 ns |
2010334 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2017250 ns |
2025458 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026313 ns |
2005917 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
716402 ns |
723141 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32736584 ns |
33170321 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10561729 ns |
10803562.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
973261 ns |
1255352 ns |
0.78 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
33458 ns |
29750 ns |
1.12 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
37667 ns |
36291.5 ns |
1.04 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
35333 ns |
35000 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
625 ns |
708 ns |
0.88 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15397 ns |
15831 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
71093 ns |
80041 ns |
0.89 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
3770.5 ns |
3417 ns |
1.10 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3791 ns |
3000 ns |
1.26 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3916.5 ns |
2958 ns |
1.32 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
3084 ns |
2292 ns |
1.35 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
143671.5 ns |
144997 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
305020.5 ns |
345563 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
8583 ns |
7167 ns |
1.20 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
7292 ns |
6208 ns |
1.17 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
7459 ns |
6042 ns |
1.23 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
11250 ns |
10458 ns |
1.08 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
42708 ns |
37804.5 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1188962 ns |
1127358 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
350459 ns |
324750 ns |
1.08 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
58610 ns |
48830 ns |
1.20 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215271 ns |
213833 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222250 ns |
221229 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222770.5 ns |
220667 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
208791.5 ns |
206167 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
253600 ns |
251783 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27159107 ns |
25462835 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7909354 ns |
7855917 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
541753 ns |
579016 ns |
0.94 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
6417 ns |
3917 ns |
1.64 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
6375 ns |
3958 ns |
1.61 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
6375 ns |
3917 ns |
1.63 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
6167 ns |
4167 ns |
1.48 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
29010 ns |
22588 ns |
1.28 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2194926 ns |
2083671 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
248395.5 ns |
226542 ns |
1.10 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
41658 ns |
42771 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
23917 ns |
14916 ns |
1.60 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
24083 ns |
15083 ns |
1.60 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
24250 ns |
14916 ns |
1.63 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
23583 ns |
14792 ns |
1.59 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
343876 ns |
316521 ns |
1.09 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11696705 ns |
11265875.5 ns |
1.04 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1015292 ns |
963479.5 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
205739.5 ns |
193022 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
130417 ns |
101709 ns |
1.28 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
122291 ns |
99958 ns |
1.22 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
125625 ns |
106041 ns |
1.18 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
120145.5 ns |
102208 ns |
1.18 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
148950 ns |
142614 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5929635 ns |
5689078 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2027208.5 ns |
2045292 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
277129 ns |
214192 ns |
1.29 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1923458.5 ns |
1924667 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1906812.5 ns |
1842979 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1677187 ns |
1918292 ns |
0.87 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1889625 ns |
1901125 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
697015 ns |
707209 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31755954.5 ns |
31631954.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10391833.5 ns |
10461667 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
945188 ns |
1220282 ns |
0.77 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18625 ns |
16604 ns |
1.12 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18958 ns |
18813 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21500 ns |
21271 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18291 ns |
18291 ns |
1 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
109595.5 ns |
111618 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3368268 ns |
3369345 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1330250 ns |
464208 ns |
2.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
92573 ns |
80435.5 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
253416.5 ns |
216042 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
248958.5 ns |
217458 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217250 ns |
216708.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224708 ns |
216395.5 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
523249 ns |
534644 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19863537 ns |
19551285.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6108125 ns |
6104084 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
441706 ns |
481515 ns |
0.92 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
24770.5 ns |
23416.5 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
31416.5 ns |
30395.5 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
29458 ns |
28583 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1167 ns |
1250 ns |
0.93 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
15968.5 ns |
16607 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
72855 ns |
81651 ns |
0.89 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
6083 ns |
4729.5 ns |
1.29 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5645.5 ns |
4916.5 ns |
1.15 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
6229 ns |
5104.5 ns |
1.22 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
5750 ns |
4875 ns |
1.18 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
213748.5 ns |
212757 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
331398 ns |
378384 ns |
0.88 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
305458.5 ns |
303792 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
306500 ns |
306416.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
308437.5 ns |
308125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
307958 ns |
306917 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230416.5 ns |
235352.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7810401.5 ns |
7753901 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
910291 ns |
895000 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
265765 ns |
273893 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
532083.5 ns |
532500 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
590291 ns |
561375 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
543916 ns |
533875 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
533520.5 ns |
538042 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1084806 ns |
1115910 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46721471 ns |
43545460 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6198229 ns |
5736646 ns |
1.08 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
811042 ns |
855458 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
39270.5 ns |
18500 ns |
2.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
39125 ns |
23125 ns |
1.69 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
41125 ns |
20875 ns |
1.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
38000 ns |
20250 ns |
1.88 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
130774 ns |
117298.5 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3722419 ns |
3644245 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1485167 ns |
475438 ns |
3.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
97071 ns |
79291 ns |
1.22 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215563 ns |
213125 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228187.5 ns |
227959 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216125 ns |
214479.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214104.5 ns |
212750 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
750448 ns |
769273 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25517860 ns |
26817998 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7039083 ns |
7163750 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
494376.5 ns |
536785 ns |
0.92 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6437.5 ns |
5292 ns |
1.22 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7041 ns |
6979 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7729 ns |
8458.5 ns |
0.91 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7667 ns |
6958 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
140363.5 ns |
144689.5 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5733917 ns |
5674338 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
775625 ns |
763958 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
64415.5 ns |
65951 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11146 ns |
9833 ns |
1.13 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10333.5 ns |
10395.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
9875 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10334 ns |
10166 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
826886 ns |
843305.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38205214 ns |
40229475 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5258333 ns |
5021354 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
381612 ns |
388453.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5812.5 ns |
5083 ns |
1.14 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4896 ns |
5645.5 ns |
0.87 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7146 ns |
7354 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6999.5 ns |
7459 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
144336 ns |
148525.5 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5913376 ns |
5807141 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
776229 ns |
768729 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
63198 ns |
67441 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7917 ns |
7459 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7834 ns |
7750 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8084 ns |
7583 ns |
1.07 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7583.5 ns |
7291 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
789077 ns |
806597 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39034761 ns |
38873703 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5581917 ns |
5499042 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
386691 ns |
394693 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14474084 ns |
14393541 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10170208 ns |
10086042 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10149541 ns |
10132625 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27806083 ns |
27847083 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
528374 ns |
531501 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
882024 ns |
400094 ns |
2.20 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46387208 ns |
45837667 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33540770.5 ns |
33412125 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33424500 ns |
33550792 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85835875 ns |
85694750 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2614141 ns |
2655274 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
4371261 ns |
3296132 ns |
1.33 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
86729 ns |
65750 ns |
1.32 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
85375 ns |
69354 ns |
1.23 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
90167 ns |
68834 ns |
1.31 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
86875 ns |
67708 ns |
1.28 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
134910 ns |
125224.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3737313 ns |
3321446 ns |
1.13 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1480854 ns |
478792 ns |
3.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
249264 ns |
228082 ns |
1.09 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
443229.5 ns |
442083 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
443333 ns |
452104 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
443291.5 ns |
442208 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
446979 ns |
444791 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
743129 ns |
744155 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26210474.5 ns |
26781484 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7557333 ns |
7548250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
712153 ns |
785568 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
500 ns |
4 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
1916 ns |
584 ns |
3.28 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2000 ns |
583 ns |
3.43 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
1833 ns |
541 ns |
3.39 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
37906 ns |
33459 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1219006 ns |
1181669 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
295145.5 ns |
266750 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
53069 ns |
47690 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10708 ns |
9104.5 ns |
1.18 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11208 ns |
8958 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11354.5 ns |
9375 ns |
1.21 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11354 ns |
8333 ns |
1.36 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
285548.5 ns |
292729 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
22239786.5 ns |
21877451 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5375687 ns |
4421083 ns |
1.22 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
378846 ns |
376084 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9875 ns |
9834 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9833 ns |
9792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9834 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9834 ns |
9834 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23103 ns |
23819 ns |
0.97 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2142525 ns |
1943243 ns |
1.10 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
221500 ns |
211083 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
204166 ns |
209072 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
50334 ns |
45958 ns |
1.10 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
50292 ns |
46375 ns |
1.08 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
50625 ns |
46167 ns |
1.10 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
49459 ns |
45542 ns |
1.09 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
308895 ns |
297740 ns |
1.04 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13063091 ns |
13019378 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
983146 ns |
1008520.5 ns |
0.97 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
497502.5 ns |
610991 ns |
0.81 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
63375 ns |
56250 ns |
1.13 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
64208 ns |
57125 ns |
1.12 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
62750 ns |
57125 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
64167 ns |
57708.5 ns |
1.11 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38397 ns |
29558.5 ns |
1.30 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1358037 ns |
1212552 ns |
1.12 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
623167 ns |
345084 ns |
1.81 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
219088 ns |
204882 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
457000 ns |
449291.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
475625 ns |
482958 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
473458 ns |
465791 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
454667 ns |
434625 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
261525 ns |
253081.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
36262740 ns |
31946764 ns |
1.14 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9115270.5 ns |
9299875.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
818947.5 ns |
887358.5 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
636625 ns |
639500 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
656458.5 ns |
610791 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
588125 ns |
650021 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
576916 ns |
613396 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
206500 ns |
213054.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
9251545 ns |
8304459 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1351250 ns |
1377667 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
356344.5 ns |
314248 ns |
1.13 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2240812.5 ns |
2230375 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2226896 ns |
2241083 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2193750 ns |
2226458 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2249750 ns |
2044000 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
984855.5 ns |
1009323.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
53834185.5 ns |
48595808 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
12082667 ns |
10250250 ns |
1.18 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1278034 ns |
1209503 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
38958 ns |
18583 ns |
2.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
40333 ns |
21500 ns |
1.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
41959 ns |
22084 ns |
1.90 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
39500 ns |
20333 ns |
1.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
128340.5 ns |
115629.5 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3938890 ns |
3530676 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1412687 ns |
529396 ns |
2.67 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
96189 ns |
79871 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222396.5 ns |
219583.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
235062.5 ns |
228750 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222312 ns |
221395.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
234292 ns |
219500 ns |
1.07 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
746419 ns |
743488 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27274809 ns |
26086313.5 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7501229 ns |
7436521 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
509240 ns |
556135 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7333 ns |
500 ns |
14.67 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7167 ns |
584 ns |
12.27 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7292 ns |
584 ns |
12.49 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7250 ns |
500 ns |
14.50 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
32772 ns |
24005 ns |
1.37 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1343594 ns |
1194343 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
375375.5 ns |
283521 ns |
1.32 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59561 ns |
47860 ns |
1.24 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17250.5 ns |
9979 ns |
1.73 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17146 ns |
10542 ns |
1.63 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17041.5 ns |
9687.5 ns |
1.76 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17104.5 ns |
9916.5 ns |
1.72 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
281216 ns |
274665.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24180264 ns |
25054245 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5904229.5 ns |
4901583 ns |
1.20 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
396149 ns |
403794 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10208.5 ns |
7750 ns |
1.32 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9041 ns |
8541 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10583 ns |
9458 ns |
1.12 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
10479 ns |
10041 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
120981.5 ns |
122963.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3539882 ns |
3342683 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
875625 ns |
828959 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69419 ns |
70460 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7958 ns |
7583 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7666 ns |
7875 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8187.5 ns |
7917 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7833 ns |
7208 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
510474 ns |
521824.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17725223 ns |
17096205 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4291812.5 ns |
3622437.5 ns |
1.18 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
315298 ns |
323444 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
10437 ns |
1375 ns |
7.59 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
9125 ns |
1708 ns |
5.34 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
10250 ns |
1875 ns |
5.47 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
9437.5 ns |
1584 ns |
5.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
23754 ns |
22394 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1177268 ns |
1154621 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
302750 ns |
310833 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
191221.5 ns |
190371.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
3209 ns |
1.32 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4291 ns |
3333 ns |
1.29 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4437.5 ns |
3583 ns |
1.24 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4209 ns |
3500 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
226224 ns |
224060 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10957018 ns |
9920013 ns |
1.10 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1683833 ns |
1731417 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
537002 ns |
581006 ns |
0.92 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
155333 ns |
145687 ns |
1.07 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
135542 ns |
128584 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
139833 ns |
129625 ns |
1.08 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
235312 ns |
226167 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
26871.5 ns |
25004 ns |
1.07 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1241545 ns |
1165561.5 ns |
1.07 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
301000 ns |
248959 ns |
1.21 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
48781 ns |
40870 ns |
1.19 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
144583 ns |
143604 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
123500 ns |
130083 ns |
0.95 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
111854.5 ns |
111208 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
265500.5 ns |
251937.5 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
221031 ns |
224391 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10748388 ns |
10232573 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2036542 ns |
1955250 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
240869 ns |
267492 ns |
0.90 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
8541 ns |
7208 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
7250 ns |
6083 ns |
1.19 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6875 ns |
6000 ns |
1.15 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
11417 ns |
10458 ns |
1.09 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38877 ns |
34049 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1252063 ns |
1180224 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
351625 ns |
325584 ns |
1.08 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
59491 ns |
50630 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221625 ns |
219688 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232708.5 ns |
237125 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230250 ns |
228500 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215395.5 ns |
212875 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
263879 ns |
270641 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27911566 ns |
29882407 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8205708 ns |
8193250 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
553398 ns |
592361 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15396 ns |
14125 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15583 ns |
15291.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
17000 ns |
16792 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15896 ns |
16000 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
139189.5 ns |
143262 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5507241.5 ns |
5352196.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
783146 ns |
756916.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
225460 ns |
233592 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23833.5 ns |
23895.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23229 ns |
24041.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23792 ns |
23542 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23500 ns |
23667 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
871309.5 ns |
888831 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40866393 ns |
38279760.5 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5525209 ns |
5301166.5 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
627300 ns |
679602 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
26229.5 ns |
8875 ns |
2.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
28208 ns |
9250 ns |
3.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
30875 ns |
11313 ns |
2.73 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
27104.5 ns |
9834 ns |
2.76 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
140173.5 ns |
126441 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3972092.5 ns |
3425975 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
837333 ns |
886021 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
84467 ns |
73581 ns |
1.15 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15542 ns |
14000 ns |
1.11 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16000 ns |
14166.5 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15959 ns |
14541 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15812.5 ns |
13875 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
677783 ns |
686454 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21242694.5 ns |
21159530.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5255020.5 ns |
5057854 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
357266 ns |
368623 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9166.5 ns |
6833 ns |
1.34 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9250 ns |
9645.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11521 ns |
10959 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9437.5 ns |
9125 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
124109.5 ns |
125289 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3514817 ns |
3340336.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
901187.5 ns |
858667 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
75430.5 ns |
73441 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13249.5 ns |
12750 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12708 ns |
12875 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13333 ns |
12959 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12895.5 ns |
12584 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
556362.5 ns |
568824 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20572384 ns |
20335817 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4649292 ns |
4008167 ns |
1.16 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
339794 ns |
341833 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
30083 ns |
26604 ns |
1.13 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
33458 ns |
35042 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31458 ns |
31437.5 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1854.5 ns |
1958 ns |
0.95 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16188 ns |
16488 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
73227 ns |
80881 ns |
0.91 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
6709 ns |
5354 ns |
1.25 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
6083 ns |
5271 ns |
1.15 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
6292 ns |
5375 ns |
1.17 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
7333 ns |
6417 ns |
1.14 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
145426 ns |
144829.5 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
351952 ns |
371354 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6833 ns |
250 ns |
27.33 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6708 ns |
417 ns |
16.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6958 ns |
375 ns |
18.55 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6792 ns |
334 ns |
20.34 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
34946 ns |
26201 ns |
1.33 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1263721 ns |
1213684 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
306583 ns |
435084 ns |
0.70 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
57747 ns |
47131 ns |
1.23 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
13916.5 ns |
6417 ns |
2.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13667 ns |
6666 ns |
2.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
13917 ns |
6708 ns |
2.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
14083.5 ns |
6541 ns |
2.15 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
201275 ns |
192082.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23279815 ns |
23595307 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5587416.5 ns |
4957208 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
391916 ns |
388663.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8750 ns |
1917 ns |
4.56 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8667 ns |
2000 ns |
4.33 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8709 ns |
2042 ns |
4.26 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8708 ns |
1959 ns |
4.45 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
35887 ns |
26999 ns |
1.33 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1236622.5 ns |
1208214.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
324104.5 ns |
281958 ns |
1.15 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
221524 ns |
206222 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
23854 ns |
16312.5 ns |
1.46 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
24291.5 ns |
17020.5 ns |
1.43 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
24291.5 ns |
16562.5 ns |
1.47 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
23979 ns |
16437.5 ns |
1.46 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
289525 ns |
281291 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
26815785 ns |
25314200 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6199395.5 ns |
5387167 ns |
1.15 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
645064 ns |
705642 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
156000 ns |
148250 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
176458 ns |
175104 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
149333 ns |
154500 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
148750 ns |
148375 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
202436 ns |
210020 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7935297 ns |
7920169 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1474875 ns |
1553375 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
189013 ns |
236022 ns |
0.80 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1044146 ns |
1326125 ns |
0.79 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1311542 ns |
1317625 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1192416.5 ns |
1267583 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1333875.5 ns |
1330208 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
917103 ns |
941055 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49231222 ns |
46042204 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6709937.5 ns |
9797270.5 ns |
0.68 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1050591 ns |
1107606 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24771 ns |
23542 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25854.5 ns |
25167 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27042 ns |
28437.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
26104.5 ns |
24917 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
236772.5 ns |
241297.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7699269.5 ns |
7644187.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1068000 ns |
558625 ns |
1.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
123260 ns |
114946.5 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
118500 ns |
174646 ns |
0.68 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
129021 ns |
167916 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
119729 ns |
119708.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
176750 ns |
126750 ns |
1.39 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1082699 ns |
1108737 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45816693 ns |
45003191 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6477521 ns |
5870834 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
580755 ns |
610886 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6792 ns |
250 ns |
27.17 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6708 ns |
417 ns |
16.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6959 ns |
375 ns |
18.56 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6792 ns |
250 ns |
27.17 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
32423 ns |
23373.5 ns |
1.39 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1227496 ns |
1207385.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
440125 ns |
274541 ns |
1.60 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
57898 ns |
47321 ns |
1.22 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
13979.5 ns |
6458 ns |
2.16 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
14000 ns |
6708 ns |
2.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
14041.5 ns |
6625 ns |
2.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
14479.5 ns |
6521 ns |
2.22 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
218337.5 ns |
207930.5 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26179966 ns |
24020738 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6079833 ns |
5321979 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
393670 ns |
394454 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7208.5 ns |
5125 ns |
1.41 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6000 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7167 ns |
7375 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6958 ns |
5500 ns |
1.27 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
145657.5 ns |
148415.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5681117 ns |
5743209.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
482812 ns |
438042 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
233405.5 ns |
233753 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10167 ns |
9708.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10396 ns |
10500 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10583.5 ns |
10292 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9916.5 ns |
10000 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
910519.5 ns |
921993 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
45597451 ns |
40800221 ns |
1.12 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6444395.5 ns |
5516833 ns |
1.17 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
630427 ns |
673881.5 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22751 ns |
22961 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2127785.5 ns |
2040345 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
224292 ns |
205708 ns |
1.09 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
213458 ns |
207722.5 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
8458 ns |
4625 ns |
1.83 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
8625 ns |
4958 ns |
1.74 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
8750 ns |
4792 ns |
1.83 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
8167 ns |
4625 ns |
1.77 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
245078 ns |
232829.5 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9946027.5 ns |
11262701.5 ns |
0.88 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1678792 ns |
1643083.5 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
466702 ns |
580356 ns |
0.80 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
26125 ns |
8166 ns |
3.20 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
25062 ns |
8250 ns |
3.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
28687.5 ns |
9458 ns |
3.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
26312.5 ns |
8979.5 ns |
2.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
138390.5 ns |
124075.5 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3609992 ns |
3484097 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
802229 ns |
848979 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
82243 ns |
73621 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
8396 ns |
1.26 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10416.5 ns |
8584 ns |
1.21 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
9084 ns |
1.17 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10145.5 ns |
8334 ns |
1.22 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
601697 ns |
601403 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20934991 ns |
21381887.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5069229 ns |
4049604 ns |
1.25 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
341518 ns |
345603 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
124062.5 ns |
123354 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129667 ns |
130833 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130625 ns |
130292 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
182895.5 ns |
183083 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46031 ns |
46276 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
102792 ns |
100861 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
312959 ns |
331291 ns |
0.94 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
315270.5 ns |
336312.5 ns |
0.94 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
315645.5 ns |
332416.5 ns |
0.95 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
569500 ns |
584792 ns |
0.97 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
197211.5 ns |
195249 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
437317 ns |
504285 ns |
0.87 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
400541.5 ns |
396500 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
290708 ns |
287958 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
290750 ns |
288167 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
759167 ns |
756292 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
51447 ns |
43813 ns |
1.17 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1442664 ns |
1397680 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
430208 ns |
359646 ns |
1.20 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
90178 ns |
81271 ns |
1.11 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1469166 ns |
1447584 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1136083 ns |
1133917 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1148791.5 ns |
1135166.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2371458 ns |
2356062 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
288834.5 ns |
251976 ns |
1.15 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11609366 ns |
10628240 ns |
1.09 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1837458.5 ns |
1770646 ns |
1.04 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
272018 ns |
350644 ns |
0.78 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
654292 ns |
641750 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
656208 ns |
660333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
643771 ns |
656625 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
660000 ns |
541646 ns |
1.22 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
202007.5 ns |
206977 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8216678 ns |
8394592 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1374958 ns |
1331770.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
347449 ns |
313564 ns |
1.11 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2445250 ns |
2445250 ns |
1 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2448125 ns |
2456229 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2446833 ns |
2446833.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2449583 ns |
2483750 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1007847 ns |
1018661.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51269988 ns |
53769994.5 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10195042 ns |
9019125 ns |
1.13 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1373375 ns |
1436974 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33084 ns |
28875 ns |
1.15 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
36250 ns |
36438 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34750 ns |
34354 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
854.5 ns |
833 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15337.5 ns |
15679 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
72977 ns |
79081 ns |
0.92 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
4333 ns |
3125 ns |
1.39 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
4208 ns |
3333 ns |
1.26 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
4312.5 ns |
3542 ns |
1.22 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
4041.5 ns |
3042 ns |
1.33 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
144791.5 ns |
141592 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
307384 ns |
340828.5 ns |
0.90 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
415000 ns |
404000 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
416625 ns |
408458 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
416875 ns |
407958 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
429041 ns |
420750 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
53909.5 ns |
44015 ns |
1.22 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1567275.5 ns |
1346061 ns |
1.16 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1167771 ns |
1099750 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
360648.5 ns |
240182 ns |
1.50 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3815583 ns |
3854416 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4001187.5 ns |
3977416.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4004792 ns |
3995708.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3808625 ns |
3786812.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
258808 ns |
247915 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
39426137.5 ns |
38628061.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11450771 ns |
11941666 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1225804 ns |
1249207.5 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
4000 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33782 ns |
34055 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1303530 ns |
1242873 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
181771 ns |
160875 ns |
1.13 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
36288 ns |
38220 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
19875 ns |
15625 ns |
1.27 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
20333 ns |
15958 ns |
1.27 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
20250 ns |
15958 ns |
1.27 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
19375 ns |
15625 ns |
1.24 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
270893 ns |
257530 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
8770612 ns |
8798187 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
878395.5 ns |
839395.5 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
171309.5 ns |
167922 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404916 ns |
403667 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
294792 ns |
295750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296167 ns |
295750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760459 ns |
760166 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113227 ns |
113514 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1035873.5 ns |
1017055 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
465708.5 ns |
326291.5 ns |
1.43 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
92718.5 ns |
87391 ns |
1.06 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1467604 ns |
1472208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1163084 ns |
1161500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1164667 ns |
1160625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2387167 ns |
2378291 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
261443 ns |
245391 ns |
1.07 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12224045 ns |
10232371 ns |
1.19 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1895167 ns |
1858625 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
312063 ns |
356813 ns |
0.87 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7041 ns |
500 ns |
14.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7000 ns |
583 ns |
12.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7000 ns |
583 ns |
12.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6959 ns |
500 ns |
13.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
34960 ns |
26329.5 ns |
1.33 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1143512 ns |
1165109.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
304770.5 ns |
458750 ns |
0.66 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
218463 ns |
207592 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
15208.5 ns |
7458 ns |
2.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
15250 ns |
7958 ns |
1.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
15125 ns |
7833 ns |
1.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
15250 ns |
7500 ns |
2.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
231132 ns |
220362.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25193745 ns |
24956286.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6197875 ns |
4949916.5 ns |
1.25 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
622512 ns |
695677 ns |
0.89 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
829416.5 ns |
824979 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
619250 ns |
619166 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
618624.5 ns |
619291 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1546958.5 ns |
1521750 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
131374 ns |
130530.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
214891 ns |
228943 ns |
0.94 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2617583.5 ns |
2673291.5 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
2007958 ns |
2003917 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2002687.5 ns |
2004458 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4949021 ns |
4938271 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
260037 ns |
246670.5 ns |
1.05 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
891295 ns |
761778 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1542 ns |
291 ns |
5.30 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1541 ns |
375 ns |
4.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1583 ns |
333 ns |
4.75 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1459 ns |
250 ns |
5.84 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
37528 ns |
32758 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1208714 ns |
1196400 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
287833 ns |
263500 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
53039 ns |
46921 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8250 ns |
6542 ns |
1.26 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
6833 ns |
1.26 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
6667 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8604.5 ns |
6333 ns |
1.36 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
223115.5 ns |
229162.5 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22661444 ns |
21326390 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5050750 ns |
4918333 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
370542 ns |
360398.5 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2380792 ns |
2389042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2405875 ns |
2375416 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2402999.5 ns |
2399208 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2376458 ns |
2395167 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
202379 ns |
205752 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8175895.5 ns |
7986200 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1391083 ns |
1428354 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
410186 ns |
375378.5 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4622209 ns |
4650833 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4659083 ns |
4663624.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4674875 ns |
4666416.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4668000 ns |
4657125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
906677.5 ns |
922860 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47558050 ns |
50907571 ns |
0.93 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6580750 ns |
6979416.5 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1259277 ns |
1386483.5 ns |
0.91 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
14042 ns |
13458.5 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
14125 ns |
7333 ns |
1.93 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
16312.5 ns |
7708 ns |
2.12 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
14813 ns |
6416.5 ns |
2.31 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
25922 ns |
23918 ns |
1.08 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1196182 ns |
1244282 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
269583.5 ns |
235958 ns |
1.14 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
47007 ns |
40260 ns |
1.17 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
33645.5 ns |
46271 ns |
0.73 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
45875 ns |
63375 ns |
0.72 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
34041 ns |
52500 ns |
0.65 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
57042 ns |
33708.5 ns |
1.69 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
221870 ns |
220952 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
13439959 ns |
10877336.5 ns |
1.24 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2109584 ns |
1059416 ns |
1.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
241391 ns |
264808 ns |
0.91 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
21166 ns |
20208.5 ns |
1.05 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
27146 ns |
25708 ns |
1.06 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24708 ns |
24770.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5208.5 ns |
5291 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
17003 ns |
17145 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
74790 ns |
83681 ns |
0.89 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
13084 ns |
12646 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
11125 ns |
10645.5 ns |
1.05 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
11459 ns |
10500 ns |
1.09 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18937.5 ns |
18146 ns |
1.04 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
232004 ns |
230722.5 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
354352 ns |
371984 ns |
0.95 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
409042 ns |
405208 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
299834 ns |
297166 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
300166.5 ns |
297541 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
765625 ns |
762459 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
55053.5 ns |
46892 ns |
1.17 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1389435 ns |
1423487.5 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
419458.5 ns |
335000 ns |
1.25 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
99876 ns |
88571 ns |
1.13 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1481812.5 ns |
1475875 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1176417 ns |
1169208 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1178375 ns |
1166834 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2397000 ns |
2378771 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
321915 ns |
287503 ns |
1.12 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
14142027 ns |
12647035 ns |
1.12 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2086041 ns |
2003291.5 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
277749 ns |
380444 ns |
0.73 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
435500 ns |
432000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
438833 ns |
436541 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
438375 ns |
436708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
449792 ns |
448208 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
60743 ns |
54845 ns |
1.11 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1011327 ns |
1004553 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1136916 ns |
1035833 ns |
1.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
388225 ns |
234772.5 ns |
1.65 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3909292 ns |
3891459 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4012979.5 ns |
4027292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4023646 ns |
4026478.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3817500 ns |
3684083 ns |
1.04 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
264711 ns |
268195 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31210175 ns |
32271096.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10463083 ns |
10269354.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1219017 ns |
1382008.5 ns |
0.88 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
11333 ns |
8750 ns |
1.30 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
10209 ns |
7667 ns |
1.33 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
10292 ns |
7667 ns |
1.34 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
14708 ns |
12417 ns |
1.18 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
31120 ns |
24204 ns |
1.29 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2299766 ns |
2100905 ns |
1.09 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
223896 ns |
211416 ns |
1.06 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
210884 ns |
209352 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
53250 ns |
45042 ns |
1.18 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
53500 ns |
45791 ns |
1.17 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53709 ns |
45208 ns |
1.19 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
53209 ns |
44959 ns |
1.18 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
373214 ns |
348332 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
13401439 ns |
12300844.5 ns |
1.09 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1767583.5 ns |
1700187.5 ns |
1.04 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
523769 ns |
655376 ns |
0.80 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
124500 ns |
121916.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
85958 ns |
144917 ns |
0.59 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
88166 ns |
88625 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
150334 ns |
105229.5 ns |
1.43 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189707 ns |
189408.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6262796.5 ns |
5999999 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1963583 ns |
1936000 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
222936 ns |
220412 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1958209 ns |
2017208 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2014500 ns |
2018750 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2023209 ns |
2014000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2028167 ns |
2017500 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
539409 ns |
544732 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27916416 ns |
27836425 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9213437.5 ns |
9082333.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
909308 ns |
961460 ns |
0.95 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.