This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
[DO NOT MERGE] feat: generate cpu kernels using KA #136
Draft
avik-pal
wants to merge
4
commits into
main
Choose a base branch
from
ap/ka_cpu
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/ka_cpu
branch
2 times, most recently
from
August 20, 2024 01:59
addd437
to
b57f2a1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #136 +/- ##
===========================================
- Coverage 83.93% 68.91% -15.02%
===========================================
Files 37 36 -1
Lines 1867 1586 -281
===========================================
- Hits 1567 1093 -474
- Misses 300 493 +193 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: f056a2d | Previous: c185f04 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7729 ns |
6083 ns |
1.27 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5584 ns |
5417 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7583 ns |
8021 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6229.5 ns |
6146 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
124269 ns |
120417 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2920363 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
865375 ns |
812042 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
420534 ns |
424375 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
10250 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10375.5 ns |
9917 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10729.5 ns |
10125 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9917 ns |
11792 ns |
0.84 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
534855 ns |
556460 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17679999 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2599750 ns |
2542833 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
688785 ns |
686027 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1417 ns |
1500 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3187 ns |
2792 ns |
1.14 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1708.5 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
2917 ns |
1583 ns |
1.84 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
21391 ns |
22218 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1301041.5 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
199625 ns |
205792 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31370 ns |
29920 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3750 ns |
3542 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3834 ns |
4209 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4125 ns |
4271 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4187 ns |
4229 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
141714.5 ns |
148035 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8405798 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1476354 ns |
1621188 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
149951.5 ns |
151742 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2939917 ns |
58542 ns |
50.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1600958.5 ns |
46375 ns |
34.52 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2726854 ns |
46584 ns |
58.54 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
4035979 ns |
83708 ns |
48.21 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36748 ns |
37608 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
561012.5 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1015666 ns |
1081917 ns |
0.94 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
81321 ns |
84866 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5227041.5 ns |
2027833 ns |
2.58 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3104542 ns |
2085458 ns |
1.49 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4209562.5 ns |
2090292 ns |
2.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
7866125 ns |
1999000 ns |
3.94 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
221458.5 ns |
233327.5 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8196194 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8040333 ns |
7717583 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1091659 ns |
1460226 ns |
0.75 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
169375.5 ns |
145375 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
173374.5 ns |
147458 ns |
1.18 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
157042 ns |
150584 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
183708.5 ns |
170437.5 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166229.5 ns |
166412 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7336842 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1527917 ns |
1615604.5 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
180901 ns |
202872 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1109479.5 ns |
1119083.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1112791 ns |
1109000 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1120229 ns |
1118458 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1106354.5 ns |
1116145.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
672058.5 ns |
707978 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37159528 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6453167 ns |
5932000 ns |
1.09 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
918327 ns |
1046946 ns |
0.88 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4375 ns |
5104.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4666.5 ns |
4250 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5334 ns |
5895.5 ns |
0.90 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5937.5 ns |
5624.5 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
88193 ns |
93783.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5394921 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
618250 ns |
721583.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
70180 ns |
70761 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
8875 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8583 ns |
8792 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9375 ns |
9083 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8750 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
579885 ns |
603451.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
37743853 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5811854 ns |
6400917 ns |
0.91 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388183 ns |
388649.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
576020.5 ns |
20083 ns |
28.68 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
323333 ns |
18812.5 ns |
17.19 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421833 ns |
20958 ns |
20.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
871208.5 ns |
18000 ns |
48.40 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
67844 ns |
68784 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3055826 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1366916.5 ns |
1334334 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
74841 ns |
83861 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
758875 ns |
224229.5 ns |
3.38 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
461208 ns |
219416 ns |
2.10 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
697312.5 ns |
219062.5 ns |
3.18 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1108021 ns |
212958 ns |
5.20 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
348707.5 ns |
360915.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
12489751.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5797395.5 ns |
5929666 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
481369 ns |
478315 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
625 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
791 ns |
709 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
917 ns |
1041 ns |
0.88 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
625 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20510 ns |
21396 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1150889.5 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
286334 ns |
303750 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
33230 ns |
32981 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1375 ns |
1459 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1625 ns |
1542 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1542 ns |
0.89 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
122441.5 ns |
127634 ns |
0.96 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8825165 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1492875 ns |
1626542 ns |
0.92 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
127486 ns |
138112 ns |
0.92 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
414458 ns |
7333 ns |
56.52 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
239375 ns |
6125 ns |
39.08 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
502000 ns |
6125 ns |
81.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
550250 ns |
10333 ns |
53.25 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23668 ns |
24384 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1312160 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
596958 ns |
700271 ns |
0.85 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47410 ns |
46841 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
723229 ns |
221166 ns |
3.27 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
454209 ns |
238834 ns |
1.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
857167 ns |
230666 ns |
3.72 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1019917 ns |
251250 ns |
4.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
189502 ns |
193817 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31508207 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8962979 ns |
8912375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
612555 ns |
653712 ns |
0.94 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4084 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4167 ns |
4125 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4083 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
24005 ns |
24189 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2002725 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
218459 ns |
223791 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48990 ns |
49151 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16833 ns |
16584 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
17000 ns |
16917 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17375 ns |
17042 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17000 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
190469.5 ns |
199158 ns |
0.96 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10178467 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
1025083 ns |
963270.5 ns |
1.06 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
180751 ns |
176322 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
511729.5 ns |
512792 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405667 ns |
404292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
406833 ns |
404896 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865750 ns |
864583 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113107.5 ns |
113852 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
398260 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
419334 ns |
448709 ns |
0.93 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
249302 ns |
250173 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2320167 ns |
2271145.5 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2032666 ns |
2031292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2032750 ns |
2033750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3296979.5 ns |
3280292 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
235772 ns |
247459 ns |
0.95 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
9316032 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1906146 ns |
2065875 ns |
0.92 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
763141 ns |
765823 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6438 ns |
7145.5 ns |
0.90 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7459 ns |
6958.5 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7500 ns |
8541 ns |
0.88 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7625.5 ns |
6479.5 ns |
1.18 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
91012.5 ns |
93682.5 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5337612 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
766083 ns |
806084 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
60811 ns |
68781 ns |
0.88 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11250 ns |
11708.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12500 ns |
11875 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12375 ns |
11000 ns |
1.13 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11750 ns |
12020.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
635802.5 ns |
642017 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
36545961 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5443750 ns |
5707875 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
410863.5 ns |
421135 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23646 ns |
24054 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2173927 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
318375 ns |
228333 ns |
1.39 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
51881 ns |
54330 ns |
0.95 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2084 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
2167 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
227819 ns |
237805 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10657257.5 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
1962541.5 ns |
1998833 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
181706.5 ns |
190172 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
36458 ns |
9333.5 ns |
3.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
37208.5 ns |
9104 ns |
4.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
68375 ns |
10521 ns |
6.50 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
33583 ns |
8959 ns |
3.75 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
110413 ns |
113550 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3069971.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
782958.5 ns |
875353.5 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
74930 ns |
78760 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
61834 ns |
16729.5 ns |
3.70 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
60417 ns |
18250 ns |
3.31 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
63250 ns |
18104 ns |
3.49 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
60792 ns |
18458 ns |
3.29 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
631585 ns |
643636 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
16329413.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
4518292 ns |
5156541 ns |
0.88 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
387253 ns |
396545 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
30375 ns |
500 ns |
60.75 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
27958 ns |
459 ns |
60.91 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
28000 ns |
625 ns |
44.80 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
27333 ns |
500 ns |
54.67 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35255 ns |
35808 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1145044 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
274042 ns |
323000 ns |
0.85 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
46671 ns |
46571 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
45666.5 ns |
10375 ns |
4.40 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
41458 ns |
9791.5 ns |
4.23 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
43667 ns |
10375 ns |
4.21 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
42083.5 ns |
10750 ns |
3.91 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
261124 ns |
262020 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
17726081.5 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4572625 ns |
5294125 ns |
0.86 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
377973 ns |
382009.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397125 ns |
399000 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288083 ns |
288125 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287792 ns |
288292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756250 ns |
755625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111029 ns |
113561 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
330051 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
364417 ns |
367729.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
77050 ns |
77481 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1457750.5 ns |
1393333 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1134000 ns |
1136083.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133146 ns |
1131458.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2441375 ns |
2438041 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
202543.5 ns |
212129 ns |
0.95 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10206496 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1612958 ns |
1596167 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
327223 ns |
329854 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7229 ns |
7708 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7125 ns |
7458.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8250 ns |
9000 ns |
0.92 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7375 ns |
7812 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
151960.5 ns |
159498.5 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5735314.5 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
664375 ns |
481750 ns |
1.38 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
60471 ns |
60340 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16083 ns |
14667 ns |
1.10 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15854 ns |
15437.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16208 ns |
15479.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
11666 ns |
14979 ns |
0.78 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
984433 ns |
1030852 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41610226 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5845542 ns |
6424458 ns |
0.91 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
438113.5 ns |
435905 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25625 ns |
26958 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
28875 ns |
25209 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28583.5 ns |
27208 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
28791 ns |
24584 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
224317.5 ns |
228128 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7495067 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
784833 ns |
1045041.5 ns |
0.75 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
115131 ns |
120221 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
151104 ns |
103791 ns |
1.46 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
151833 ns |
150833 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
108020.5 ns |
148187 ns |
0.73 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
113208 ns |
116292 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1188662.5 ns |
1163495 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42135085 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5881500 ns |
6459417 ns |
0.91 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
596435 ns |
607082 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
80312 ns |
76416 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
77791 ns |
81020.5 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79583 ns |
85083 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
82062 ns |
79625 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
233891 ns |
234622 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7191541 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
526958 ns |
628124.5 ns |
0.84 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
127251 ns |
127432 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
277958 ns |
283166.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
295125 ns |
316541 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
301229.5 ns |
302917 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
274292 ns |
315041.5 ns |
0.87 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1226024.5 ns |
1204655 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
40089795 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6466583 ns |
6660083 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
696586 ns |
700322.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16333 ns |
16708.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17375 ns |
17333 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
17708.5 ns |
17854.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16854.5 ns |
16479 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
164956.5 ns |
167006 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5631236 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
439750 ns |
446708 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239672 ns |
239982 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27208 ns |
26125 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25604 ns |
26917 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28833 ns |
27208 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27708 ns |
25333 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
1024329 ns |
1047898 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44101221 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5779875 ns |
6661333.5 ns |
0.87 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
705756 ns |
718328 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
46000 ns |
11084 ns |
4.15 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
47166 ns |
11625 ns |
4.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
50208 ns |
13000 ns |
3.86 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
44396.5 ns |
11666.5 ns |
3.81 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
138159.5 ns |
141188 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3525333.5 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
785667 ns |
897333.5 ns |
0.88 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
241372 ns |
243182.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
66208 ns |
22145.5 ns |
2.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
62437.5 ns |
21875 ns |
2.85 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
66708.5 ns |
22667 ns |
2.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
62584 ns |
21792 ns |
2.87 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
740432.5 ns |
756695 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22275014 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5147542 ns |
5374500 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
685850 ns |
695018 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
595667 ns |
63937.5 ns |
9.32 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
351042 ns |
63500 ns |
5.53 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
434375 ns |
66042 ns |
6.58 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
875667 ns |
63666.5 ns |
13.75 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
121655 ns |
124307.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3306959 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1367291 ns |
1367917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
228942 ns |
241283 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
1040166 ns |
437854 ns |
2.38 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
754187.5 ns |
464833 ns |
1.62 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
938959 ns |
474208 ns |
1.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1319167 ns |
437729.5 ns |
3.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
552090.5 ns |
560487 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21939037.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6260750 ns |
6247083 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
718076 ns |
733228 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7667 ns |
7104.5 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7770.5 ns |
7083 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7917 ns |
8334 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7334 ns |
7604 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
159948.5 ns |
163142 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5546005 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
437729 ns |
463833.5 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59871 ns |
68371 ns |
0.88 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14958 ns |
14542 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15708 ns |
15396 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17042 ns |
15458.5 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13500 ns |
14750 ns |
0.92 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1010082 ns |
1022438 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38267417 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5491083.5 ns |
6461041 ns |
0.85 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
405844 ns |
412334 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6162395.5 ns |
6159375 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6376146 ns |
6372249.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6367229 ns |
6374125 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11918000 ns |
11910167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301585.5 ns |
302029 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
295713 ns |
302953 ns |
0.98 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19131208.5 ns |
19119687 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19953937.5 ns |
19945437.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19969917 ns |
20008771 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36476917 ns |
36510208.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1015986 ns |
1019652 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1169730 ns |
1173152.5 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
959 ns |
1000 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
917 ns |
958 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
959 ns |
959 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
917 ns |
959 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23705 ns |
23843 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2142526 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
215416.5 ns |
335916 ns |
0.64 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
215342 ns |
215882 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3625 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3708 ns |
3666 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
293053.5 ns |
300289 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11377670 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2046437 ns |
2148500 ns |
0.95 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
646355 ns |
644731.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
17479 ns |
8334 ns |
2.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
17041 ns |
8104 ns |
2.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
24583 ns |
9750 ns |
2.52 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
17417 ns |
8500 ns |
2.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
134611 ns |
137456 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3320553.5 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
791583 ns |
796375 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67721 ns |
68311 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
17646 ns |
11666 ns |
1.51 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
19250 ns |
12083 ns |
1.59 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
20270.5 ns |
12583 ns |
1.61 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
18625 ns |
12750 ns |
1.46 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
703363 ns |
721292 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20902978.5 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5212209 ns |
5345750 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
362123 ns |
373344 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23131 ns |
23235 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2015510 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
315458.5 ns |
226791 ns |
1.39 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
54431 ns |
51721 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
2917 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
3083 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3417 ns |
3250 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3167 ns |
2834 ns |
1.12 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
207060.5 ns |
216259.5 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9314773 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1563646 ns |
1692958 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
173121.5 ns |
161612 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
47041.5 ns |
11625 ns |
4.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
46250 ns |
11229.5 ns |
4.12 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
49562.5 ns |
13250 ns |
3.74 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
46000 ns |
12166 ns |
3.78 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
135869.5 ns |
139967.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3451740 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
845208 ns |
892584 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
240542 ns |
243863 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
62917 ns |
21083 ns |
2.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
58750 ns |
20396 ns |
2.88 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
61708 ns |
26062.5 ns |
2.37 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
58625 ns |
21604 ns |
2.71 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
635101 ns |
652418 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19617768 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4454000 ns |
4821708.5 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
654596 ns |
672612 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4416 ns |
4417 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4458 ns |
4416 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4458 ns |
4333 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24739 ns |
24831 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2281052 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
220917 ns |
223938 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
52320 ns |
52890.5 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16750 ns |
16333.5 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16625 ns |
16750 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16834 ns |
16583 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16667 ns |
16625 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
347277.5 ns |
356581 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12189793 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1125021 ns |
1752937.5 ns |
0.64 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
210311.5 ns |
210052 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
31041 ns |
1958 ns |
15.85 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
31625 ns |
1917 ns |
16.50 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
32209 ns |
2166 ns |
14.87 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
30792 ns |
2084 ns |
14.78 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36064 ns |
36754.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1217452 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
274750 ns |
299041 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
208381 ns |
208032 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
51854 ns |
16958.5 ns |
3.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
51125 ns |
19042 ns |
2.68 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
52291.5 ns |
17458 ns |
3.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
51042 ns |
18062.5 ns |
2.83 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
299796 ns |
307642 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20632540.5 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4717208 ns |
5677458.5 ns |
0.83 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
702336 ns |
709468 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
60270.5 ns |
59125 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65125 ns |
66208 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65603.5 ns |
66083.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
53812.5 ns |
51334 ns |
1.05 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66512 ns |
66592 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
95131 ns |
113701 ns |
0.84 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
190937.5 ns |
210458 ns |
0.91 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
164000.5 ns |
143000 ns |
1.15 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
160125 ns |
119583 ns |
1.34 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
317979 ns |
307688 ns |
1.03 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
229832 ns |
234156 ns |
0.98 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
581689.5 ns |
598956 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
4059916 ns |
123833.5 ns |
32.79 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2016750 ns |
123125 ns |
16.38 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1840083 ns |
86500 ns |
21.27 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
6874334 ns |
82958 ns |
82.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192888 ns |
190129 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5436632 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2069896 ns |
1825667 ns |
1.13 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
170621.5 ns |
188412 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5660770.5 ns |
1927375 ns |
2.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3272042 ns |
1909416.5 ns |
1.71 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3180250 ns |
1906875 ns |
1.67 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
8897583.5 ns |
1931021 ns |
4.61 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
567485 ns |
578778.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25370262 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9213458 ns |
9303959 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1081814 ns |
1081141.5 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21777 ns |
22349 ns |
0.97 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2224619 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
347709 ns |
372291 ns |
0.93 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
45540 ns |
45590 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1833 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
262738 ns |
272164 ns |
0.97 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9744251 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1535584 ns |
1469500 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
184041.5 ns |
187152 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
16917 ns |
9250 ns |
1.83 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
16458 ns |
8708 ns |
1.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
20750 ns |
11166 ns |
1.86 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
19354 ns |
10459 ns |
1.85 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
131285 ns |
134628.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3471620.5 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
840250 ns |
897749.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
238722 ns |
241763 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
14270.5 ns |
10125 ns |
1.41 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
15583 ns |
8458.5 ns |
1.84 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
16000 ns |
14375 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
15125 ns |
9542 ns |
1.59 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
572024 ns |
584537 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20186707 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4247687.5 ns |
4632562 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
633925 ns |
645752 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3974792 ns |
58375 ns |
68.09 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1894375 ns |
46625 ns |
40.63 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1739604 ns |
46708 ns |
37.24 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
6779583 ns |
82000 ns |
82.68 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39793 ns |
40806 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1345357.5 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1158875 ns |
1140854.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
80431 ns |
78371 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5226750 ns |
1934584 ns |
2.70 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3143125.5 ns |
1981708 ns |
1.59 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3775208 ns |
1989334 ns |
1.90 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
8311708 ns |
1899750 ns |
4.38 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
234142.5 ns |
239556 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30862071 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11350417 ns |
11301583 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1021228 ns |
1030691 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
419521 ns |
422125 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
418166 ns |
417583 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
422375 ns |
419750 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
417708.5 ns |
416292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
236927 ns |
241184 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7988410.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
527104.5 ns |
546083 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
288373 ns |
289943 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
677458 ns |
752875.5 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
681312.5 ns |
755666 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
760583 ns |
675729 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
747563 ns |
760021 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1135579.5 ns |
1151706 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46261240 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6579583 ns |
6939708 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
927168 ns |
927380 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3447833 ns |
3457437.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3420292 ns |
3437021 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3413833 ns |
3434709 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3462042 ns |
3439146 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
174701 ns |
201324 ns |
0.87 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8218002.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1372271 ns |
1424084 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
436313 ns |
412665 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6182625.5 ns |
6238000 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6194250 ns |
6200250 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6217479.5 ns |
6194458 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6188333 ns |
6143770.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1072717 ns |
1091727.5 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50181237 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7350271 ns |
8063541.5 ns |
0.91 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1561243 ns |
1569386 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
472292 ns |
473666 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341917 ns |
340792 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
340625 ns |
342166 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
903292 ns |
905125 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46946.5 ns |
46953 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
390354 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
413792 ns |
496959 ns |
0.83 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
250932 ns |
251203 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2324958 ns |
2275334 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2037313 ns |
2043625 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2038958.5 ns |
2032437 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3288312.5 ns |
3282416.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
254014 ns |
283225 ns |
0.90 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
8472255 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2170416.5 ns |
2237145.5 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
791641 ns |
791808 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2963708 ns |
57833 ns |
51.25 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1577375 ns |
45958 ns |
34.32 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2716541 ns |
46250 ns |
58.74 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
4279458 ns |
82792 ns |
51.69 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28036 ns |
28918 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1033104 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1161396 ns |
1145250 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75880 ns |
78811 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5369084 ns |
2000229 ns |
2.68 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3250416.5 ns |
2089833 ns |
1.56 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4448458.5 ns |
2077250 ns |
2.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
8412583 ns |
1980437.5 ns |
4.25 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
237966 ns |
244212 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37959803 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11555333 ns |
11407979 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1047409 ns |
1055251 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3974229 ns |
58000 ns |
68.52 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1921167 ns |
46250 ns |
41.54 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1742209 ns |
46666 ns |
37.33 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
6685833.5 ns |
83041 ns |
80.51 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
49792 ns |
50656 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
818474 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1117792 ns |
1123000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78246 ns |
73121 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5155834 ns |
1903916 ns |
2.71 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3043500 ns |
1902541 ns |
1.60 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3837667 ns |
1978250 ns |
1.94 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
7749958 ns |
1902959 ns |
4.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
244579 ns |
251664 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
18235453 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10215937.5 ns |
9794437.5 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
927928 ns |
936124.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4333 ns |
333 ns |
13.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4209 ns |
292 ns |
14.41 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4834 ns |
416 ns |
11.62 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3875 ns |
292 ns |
13.27 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35010 ns |
35119.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1256608 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
281667 ns |
308104.5 ns |
0.91 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48850 ns |
50550 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
13250 ns |
7937.5 ns |
1.67 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
13375 ns |
7625 ns |
1.75 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
15500 ns |
7625 ns |
2.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
12459 ns |
8167 ns |
1.53 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
206263 ns |
218323.5 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20165898 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4657770.5 ns |
4836354 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
373924 ns |
381674 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31771 ns |
33417 ns |
0.95 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1188142 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
255291.5 ns |
259375 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
39421 ns |
43851 ns |
0.90 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2708 ns |
2792 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3375 ns |
2875 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3459 ns |
2916 ns |
1.19 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2667 ns |
2667 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
195647.5 ns |
205231.5 ns |
0.95 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7917427.5 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1054125 ns |
1294875 ns |
0.81 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
152102 ns |
166746 ns |
0.91 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
4033875 ns |
437042 ns |
9.23 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2203417 ns |
422021 ns |
5.22 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1936229 ns |
424229 ns |
4.56 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
7212459 ns |
425834 ns |
16.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
139985 ns |
142985.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6197825 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2258375 ns |
2238375 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
355557.5 ns |
375684 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
7756667 ns |
3809770.5 ns |
2.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5382333 ns |
3802375 ns |
1.42 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5281167 ns |
3804250 ns |
1.39 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
11556729.5 ns |
3793125 ns |
3.05 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
770872 ns |
782254 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32375322 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10991875 ns |
11146187.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1497828 ns |
1312364 ns |
1.14 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49915312.5 ns |
49907416.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35517417 ns |
35559584 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35546604.5 ns |
35529250 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
98239417 ns |
96899084 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1611413 ns |
1625871 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1014223.5 ns |
1003290 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154593958 ns |
154966354 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112413083.5 ns |
112363000 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112280959 ns |
112555750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
300055687.5 ns |
296527604.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6477347 ns |
6450345 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5540958 ns |
5530212.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
20083 ns |
19374.5 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
18562.5 ns |
18750 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
16041 ns |
17353.5 ns |
0.92 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15375 ns |
15188 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20323 ns |
20779 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1119679 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
223979.5 ns |
224333 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26100 ns |
26660 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10667 ns |
10917 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
8854.5 ns |
8834 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9083 ns |
9291 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17708 ns |
17291 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
290480 ns |
299343 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
9939862 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1566792 ns |
1655375 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
151932 ns |
155331 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
16500 ns |
8312.5 ns |
1.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
17583 ns |
8459 ns |
2.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
20083 ns |
10895.5 ns |
1.84 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
16500 ns |
9312.5 ns |
1.77 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
129278.5 ns |
142637 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3610260.5 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
796375.5 ns |
798083 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
238692 ns |
241143 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
15834 ns |
10333.5 ns |
1.53 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
15416 ns |
9042 ns |
1.70 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
17687.5 ns |
9583 ns |
1.85 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
15375 ns |
8937.5 ns |
1.72 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
686867 ns |
705801.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23595623.5 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
4793187 ns |
5435917 ns |
0.88 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
666035 ns |
657647 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
36458.5 ns |
9020.5 ns |
4.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
38333 ns |
10229 ns |
3.75 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
74333.5 ns |
11250 ns |
6.61 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
37458.5 ns |
9792 ns |
3.83 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
133913.5 ns |
137059 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3390967 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
833187.5 ns |
882166.5 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
75120 ns |
78120 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
54083 ns |
13020.5 ns |
4.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
50583 ns |
12583.5 ns |
4.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
54750 ns |
13583 ns |
4.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
50854 ns |
13458 ns |
3.78 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
639697.5 ns |
651470.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19609315 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4485208.5 ns |
4779312.5 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
346863 ns |
356033 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4208 ns |
459 ns |
9.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4250 ns |
458 ns |
9.28 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4834 ns |
625 ns |
7.73 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4083 ns |
459 ns |
8.90 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34711 ns |
35430 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1210488 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
275854.5 ns |
385417 ns |
0.72 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
208282 ns |
210072 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
14146 ns |
8166 ns |
1.73 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
13937.5 ns |
8000 ns |
1.74 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
15666 ns |
8937.5 ns |
1.75 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
13708 ns |
8208 ns |
1.67 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
230721 ns |
238141 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22496798.5 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4937375 ns |
5550500 ns |
0.89 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
666285.5 ns |
670717 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
17541.5 ns |
16417 ns |
1.07 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
16250 ns |
16709 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
14521 ns |
15209 ns |
0.95 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
11125 ns |
10312.5 ns |
1.08 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
21465 ns |
21707 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1152895.5 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
208042 ns |
217458 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
188121 ns |
194532 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32062.5 ns |
31854.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32417 ns |
32167 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32208 ns |
32250 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32395.5 ns |
32125 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
306637 ns |
316460 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11224318 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1690625 ns |
1889916 ns |
0.89 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
606015 ns |
608847 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3934979 ns |
450417 ns |
8.74 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2217312.5 ns |
482813 ns |
4.59 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2342145.5 ns |
444604 ns |
5.27 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
6834708 ns |
440875 ns |
15.50 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194727 ns |
193879 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5990269 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2080250 ns |
2124500 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
352143 ns |
376794 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
7550250 ns |
3673458 ns |
2.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5240834 ns |
3802062.5 ns |
1.38 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5157500 ns |
3822709 ns |
1.35 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
10890000 ns |
3821333 ns |
2.85 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
578849 ns |
588897 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29191067 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10413125 ns |
9577042 ns |
1.09 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1218310 ns |
1393435 ns |
0.87 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
784817521 ns |
783185125 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
542901125 ns |
542907542 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
544360541 ns |
543132625 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1524372458 ns |
1514951833.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22761020.5 ns |
22763713 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
13995220 ns |
14159478.5 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2525053917 ns |
2527739209 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
3158105667 ns |
1799023667 ns |
1.76 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1790903166 ns |
1787795417 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
4813364000 ns |
4787274417 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
374548499 ns |
333649192 ns |
1.12 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88056302 ns |
88087394 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77334 ns |
76666.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76667 ns |
79083 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
78916.5 ns |
79375 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
77375 ns |
78124.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
235013.5 ns |
238895.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7866147 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
524979 ns |
542209 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
109721 ns |
111271 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
280208.5 ns |
277000 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
293084 ns |
278895.5 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
285646 ns |
194979 ns |
1.47 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
192125 ns |
259250 ns |
0.74 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1120762 ns |
1134646.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
47294846.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6215791 ns |
6160709 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
643071 ns |
645127 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199524833.5 ns |
199977437.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139211333 ns |
139216750 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139220750 ns |
139454459 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
389521625 ns |
389873250 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5844935.5 ns |
5849131.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3420559 ns |
3425810.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619399375 ns |
621409333 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
439802250 ns |
440537375 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439385416.5 ns |
440145604 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1198327959 ns |
1186223625 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26665852 ns |
26711378 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21821827 ns |
21741902 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
553792 ns |
7291 ns |
75.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
280542 ns |
6084 ns |
46.11 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
390937.5 ns |
6291 ns |
62.14 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
847750 ns |
10292 ns |
82.37 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27899 ns |
28202.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1268229 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
559687.5 ns |
601583 ns |
0.93 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48850 ns |
48405.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
708542 ns |
220749.5 ns |
3.21 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
455333 ns |
222374.5 ns |
2.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
890499.5 ns |
222542 ns |
4.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1003292 ns |
217625 ns |
4.61 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
237988 ns |
245623 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32870826 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8991520.5 ns |
8971334 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
532595 ns |
543906 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
17521 ns |
8145.5 ns |
2.15 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
16791.5 ns |
10083 ns |
1.67 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
23145.5 ns |
10833.5 ns |
2.14 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
20334 ns |
10000.5 ns |
2.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
130687 ns |
136003.5 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3317034 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
863959 ns |
906833 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
72930 ns |
72945.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
13104.5 ns |
7500 ns |
1.75 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
13000 ns |
7209 ns |
1.80 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
14667 ns |
8292 ns |
1.77 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
13416 ns |
7500 ns |
1.79 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
569833.5 ns |
587405 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20622054.5 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4421166.5 ns |
4757959 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
320658 ns |
326203 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
30500 ns |
500 ns |
61 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
27459 ns |
458 ns |
59.95 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
28375 ns |
542 ns |
52.35 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
27666 ns |
375 ns |
73.78 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26372 ns |
26999 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1201158 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
459833 ns |
493458.5 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
49171 ns |
49231 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
46604.5 ns |
9458 ns |
4.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
44708 ns |
10250 ns |
4.36 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
46374.5 ns |
10521.5 ns |
4.41 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
44666.5 ns |
10125 ns |
4.41 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
267493.5 ns |
275766.5 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23479888 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5496458 ns |
6076395.5 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
392323 ns |
401444 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
108437.5 ns |
107104.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
99500 ns |
99896 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
99917 ns |
101145.5 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146520.5 ns |
146459 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
24197 ns |
24813 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1207580 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
260875 ns |
277416.5 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
190092 ns |
192192 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
478708 ns |
479500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
523083 ns |
494084 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
497334 ns |
478958 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
479062.5 ns |
528667 ns |
0.91 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
248813 ns |
258431 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11991369.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2125229 ns |
2276458 ns |
0.93 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
622225 ns |
624467 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5562.5 ns |
5750.5 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
5917 ns |
6917 ns |
0.86 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7479 ns |
6833.5 ns |
1.09 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4334 ns |
4458 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
16547 ns |
18139 ns |
0.91 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
72491 ns |
73231 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11687.5 ns |
11854 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10584 ns |
10500.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11187.5 ns |
11104.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16541 ns |
17083 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
231345 ns |
235890 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
377104 ns |
372074 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
40000 ns |
38750 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
52437 ns |
51292 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52125.5 ns |
52729.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13791 ns |
15834 ns |
0.87 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20260 ns |
20456 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
79691 ns |
87011 ns |
0.92 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
41500 ns |
36875 ns |
1.13 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
30416.5 ns |
34729 ns |
0.88 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
31125 ns |
32167 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57292 ns |
57000 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
208650 ns |
212876 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
403134 ns |
418835 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1791 ns |
1791 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1709 ns |
1708 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2104.5 ns |
2187.5 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
2125 ns |
1875 ns |
1.13 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
20031.5 ns |
20570.5 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1108740 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
302250 ns |
329917 ns |
0.92 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
28990 ns |
31020 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2292 ns |
2209 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2166.5 ns |
2250 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2333 ns |
2500 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2208 ns |
2208 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
218720 ns |
226270.5 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9455550 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1546125 ns |
1683458.5 ns |
0.92 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
139222 ns |
142136.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5583.5 ns |
5042 ns |
1.11 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4646 ns |
4500 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5583 ns |
6208.5 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6417 ns |
4666.5 ns |
1.38 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
159519 ns |
163224.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5766275 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
443417 ns |
800792 ns |
0.55 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
60471 ns |
75611 ns |
0.80 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
8291 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8083 ns |
8209 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8750 ns |
8583 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8000 ns |
8250 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
940829 ns |
960930 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
39155504 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5477667 ns |
5752708 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
385303 ns |
398144 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
549708 ns |
56791 ns |
9.68 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
313333 ns |
57459 ns |
5.45 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
420167 ns |
57667 ns |
7.29 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
879208 ns |
58208 ns |
15.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37342 ns |
38436 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1173763 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
356041.5 ns |
411813 ns |
0.86 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
207582 ns |
218852 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
965958 ns |
448812.5 ns |
2.15 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
690250 ns |
499084 ns |
1.38 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
979708 ns |
465709 ns |
2.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1175375 ns |
481396 ns |
2.44 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
273355 ns |
282356.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26028179 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8107333.5 ns |
7964500 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
800411.5 ns |
842729 ns |
0.95 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3334104.5 ns |
3322916 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2338854.5 ns |
2338771 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2337104 ns |
2339375 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6329958 ns |
6304166.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205352 ns |
204545 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
205572 ns |
202912 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11443250 ns |
11552375 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8304750.5 ns |
8313541.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8298563 ns |
8336875 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21095145.5 ns |
21101437.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
735485 ns |
734673 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1063824.5 ns |
1078791.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4854.5 ns |
6166 ns |
0.79 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5583 ns |
4916.5 ns |
1.14 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6291 ns |
6541 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6791.5 ns |
4875 ns |
1.39 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
152797.5 ns |
158133 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5417356 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
774375 ns |
887167 ns |
0.87 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56590 ns |
57035.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
7166 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7041.5 ns |
7209 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7875 ns |
7292 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7459 ns |
7083 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
789393 ns |
816855 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
34656454 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5257437.5 ns |
6166979.5 ns |
0.85 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
378563 ns |
384744.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3119334 ns |
123458 ns |
25.27 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1745083 ns |
131229 ns |
13.30 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2891458.5 ns |
100000 ns |
28.91 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
4560645.5 ns |
94625 ns |
48.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
159047 ns |
160516.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6050883 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2930958 ns |
2207458 ns |
1.33 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185181 ns |
187112 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6113166 ns |
1964000 ns |
3.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3539750 ns |
2023146 ns |
1.75 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3525062.5 ns |
2028667 ns |
1.74 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
9720709 ns |
2018916.5 ns |
4.81 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
771116 ns |
789517 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33201774 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10973750 ns |
11417250 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1117479.5 ns |
1260093 ns |
0.89 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34146 ns |
33813 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
36146 ns |
36729 ns |
0.98 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
34479.5 ns |
34708.5 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
708 ns |
667 ns |
1.06 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15370 ns |
15818 ns |
0.97 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
80930 ns |
82161 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2750 ns |
2583 ns |
1.06 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2792 ns |
2709 ns |
1.03 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3167 ns |
2959 ns |
1.07 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2229.5 ns |
2125 ns |
1.05 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
146666.5 ns |
152979.5 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
353043 ns |
352884 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
559458 ns |
7250 ns |
77.17 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
282458 ns |
6042 ns |
46.75 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
394500 ns |
6125 ns |
64.41 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
826875 ns |
9958 ns |
83.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36672 ns |
37656 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1162707 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
658604.5 ns |
431042 ns |
1.53 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49520 ns |
49591 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
696041 ns |
214000 ns |
3.25 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
435812 ns |
232937.5 ns |
1.87 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
870770.5 ns |
221834 ns |
3.93 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
985604.5 ns |
232000 ns |
4.25 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
252602.5 ns |
258714 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26336772 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8008125 ns |
7857271 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
519460 ns |
526085 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
4000 ns |
3917 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
4000 ns |
3958 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22005 ns |
22767 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2135206 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
242458 ns |
244500 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
46161 ns |
47941 ns |
0.96 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15042 ns |
14667 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14958 ns |
15000 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15125 ns |
14959 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14917 ns |
14959 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
334532.5 ns |
344878 ns |
0.97 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11453292 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
997792 ns |
1074437.5 ns |
0.93 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
192172 ns |
201792 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
4158541.5 ns |
120021 ns |
34.65 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2056417 ns |
98958.5 ns |
20.78 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1881542 ns |
104666.5 ns |
17.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
7039958 ns |
144250 ns |
48.80 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
148404 ns |
160419 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5796388 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2225209 ns |
2228291 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
184542 ns |
170682 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5874083 ns |
1891375 ns |
3.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3428459 ns |
1833541.5 ns |
1.87 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3434291 ns |
1894375 ns |
1.81 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
9513125 ns |
1924667 ns |
4.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
754215 ns |
772105.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30850166 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10913312.5 ns |
10866208 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1087499 ns |
1240333 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
467937.5 ns |
20250 ns |
23.11 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
278209 ns |
18937.5 ns |
14.69 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
528667 ns |
20542 ns |
25.74 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
599187.5 ns |
20208 ns |
29.65 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
124428.5 ns |
127944 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3459386 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1420625 ns |
1385750 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
81731 ns |
82111 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
771479 ns |
216708 ns |
3.56 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
473208 ns |
255583 ns |
1.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
708938 ns |
218146 ns |
3.25 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1080250 ns |
217458 ns |
4.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
558260 ns |
580859 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19290584 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6248458 ns |
6240292 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
474794.5 ns |
484605 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
25104 ns |
25687 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
31666 ns |
31687.5 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28333 ns |
29145.5 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1291.5 ns |
1541 ns |
0.84 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16491 ns |
17059 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
82631 ns |
83471 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4708 ns |
4896 ns |
0.96 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
4854.5 ns |
4687.5 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5542 ns |
5208 ns |
1.06 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4708 ns |
4708 ns |
1 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
224621.5 ns |
231729 ns |
0.97 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
385044 ns |
400815 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
304124.5 ns |
304916 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
304917 ns |
307083.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
306208 ns |
310250 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
309792 ns |
307458 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
256784 ns |
260954.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8317280 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1129124.5 ns |
1003667 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
277373 ns |
282392 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
585104 ns |
530417 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
529541.5 ns |
536417 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
543124.5 ns |
533416.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
529708.5 ns |
540917 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1169406 ns |
1194615.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44300140 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6115584 ns |
6650583.5 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
867787 ns |
886938 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
597416.5 ns |
19292 ns |
30.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
329083.5 ns |
20437.5 ns |
16.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
410917 ns |
21583 ns |
19.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
891521 ns |
19250 ns |
46.31 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
128708.5 ns |
134679 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3777831.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1542125 ns |
1513292 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76581 ns |
76825.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
800041 ns |
215083 ns |
3.72 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
475291.5 ns |
212625 ns |
2.24 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
701041.5 ns |
215021 ns |
3.26 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1201209 ns |
249312.5 ns |
4.82 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
853223 ns |
889532 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24974069 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7559729 ns |
7210062.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
541885 ns |
554056 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6312.5 ns |
6583 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6666 ns |
6937.5 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7875 ns |
9208 ns |
0.86 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7875 ns |
6792 ns |
1.16 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
155050 ns |
160487 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5878771.5 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
786750 ns |
869792 ns |
0.90 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
69765.5 ns |
69890 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10187.5 ns |
10000 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9958 ns |
9854.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11645.5 ns |
10292 ns |
1.13 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9791.5 ns |
10375 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
870241 ns |
896806 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37856106 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5278479 ns |
5937375 ns |
0.89 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
388448.5 ns |
398234 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4500 ns |
4125 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5042 ns |
5542 ns |
0.91 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6750 ns |
0.89 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6729.5 ns |
4750 ns |
1.42 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
158522 ns |
162556 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5543448 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
772000 ns |
844750 ns |
0.91 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
62160 ns |
62561 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7500 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7542 ns |
7250 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8312.5 ns |
7667 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7541 ns |
7166 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
816695 ns |
844691 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
38779263 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5543708.5 ns |
5794250.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
392668.5 ns |
401898.5 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14518750 ns |
14528708 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10129624.5 ns |
10144083 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10127042 ns |
10119791 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27754541 ns |
27783209 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
528414 ns |
561716 ns |
0.94 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
390163 ns |
405538.5 ns |
0.96 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46474417 ns |
46624812 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33475041.5 ns |
33411666.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33403292 ns |
33562500 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85360084 ns |
85401583 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2627902 ns |
2800168 ns |
0.94 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3282258.5 ns |
3289235 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
610750 ns |
66500 ns |
9.18 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
374458 ns |
68375 ns |
5.48 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
440146 ns |
68875 ns |
6.39 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
974562.5 ns |
67250 ns |
14.49 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
136474.5 ns |
138855.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3590577 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1546353.5 ns |
1526666.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
228622.5 ns |
238492 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
1080542 ns |
444500.5 ns |
2.43 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
751042 ns |
442146 ns |
1.70 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
968354 ns |
441583 ns |
2.19 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1376125 ns |
493750 ns |
2.79 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
789159.5 ns |
807637.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26131425 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7922542 ns |
7704542 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
799502 ns |
803267.5 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
22333 ns |
542 ns |
41.20 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
23375 ns |
625 ns |
37.40 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
49958 ns |
666 ns |
75.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
18041 ns |
500 ns |
36.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32198 ns |
33435 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1217074 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
290584 ns |
422917 ns |
0.69 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49801 ns |
52200 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
45042 ns |
9458.5 ns |
4.76 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
40959 ns |
10771 ns |
3.80 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
43250 ns |
10416.5 ns |
4.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
41083.5 ns |
9666.5 ns |
4.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
294520 ns |
303460.5 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21753846 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4830770.5 ns |
5666958.5 ns |
0.85 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
387483 ns |
397994 ns |
0.97 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9834 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9834 ns |
9916 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9875 ns |
9792 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9875 ns |
9792 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23374 ns |
24011 ns |
0.97 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2073950.5 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
218625 ns |
225541 ns |
0.97 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
216122 ns |
218962 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46334 ns |
46000 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45833 ns |
46167 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46458 ns |
46416 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46166 ns |
46334 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
301086 ns |
315869 ns |
0.95 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11347035 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
963208 ns |
1098270.5 ns |
0.88 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
625096 ns |
628475.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
551625 ns |
56250 ns |
9.81 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
317875 ns |
57208 ns |
5.56 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
413916 ns |
57167 ns |
7.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
884791.5 ns |
57833 ns |
15.30 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
28786 ns |
29662 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1291553 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
647917 ns |
616041 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206132 ns |
218842 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
995541.5 ns |
450333 ns |
2.21 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
719479.5 ns |
473958 ns |
1.52 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
1170021 ns |
468792 ns |
2.50 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1244792 ns |
442709 ns |
2.81 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
252769 ns |
260564.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
33932766 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9473042 ns |
9323750 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
845987 ns |
849638 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
614625 ns |
607437.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
650709 ns |
677167 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
649833 ns |
619062.5 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
642250 ns |
645083.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
227111 ns |
227369 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8674519 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1360563 ns |
1393791.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
249152.5 ns |
251853 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2246395.5 ns |
2229542 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2228000 ns |
2242667 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2229937.5 ns |
2238417 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2234687.5 ns |
2233500 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1048698.5 ns |
1055691 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48260078.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7344750 ns |
7106083 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1381842 ns |
1380353 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
501104.5 ns |
20396 ns |
24.57 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
291562.5 ns |
20625 ns |
14.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
528291.5 ns |
21333.5 ns |
24.76 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
607896 ns |
23708 ns |
25.64 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
127286.5 ns |
128483 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3562244 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1528187.5 ns |
1530250 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76185.5 ns |
82281 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
802209 ns |
219229.5 ns |
3.66 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
495791.5 ns |
223875 ns |
2.21 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
717333.5 ns |
221083 ns |
3.24 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
1168708 ns |
219917 ns |
5.31 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
826637 ns |
851484 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27600838 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7736667 ns |
7710292 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
564375 ns |
562290 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
23167 ns |
500 ns |
46.33 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
22709 ns |
584 ns |
38.89 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
49750 ns |
625 ns |
79.60 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
19792 ns |
500 ns |
39.58 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23038 ns |
23568 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1182719 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
323125 ns |
453729.5 ns |
0.71 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50411 ns |
50170 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
46875 ns |
10937.5 ns |
4.29 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
44417 ns |
10479 ns |
4.24 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
46979 ns |
11166 ns |
4.21 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
44541.5 ns |
10208 ns |
4.36 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
273384 ns |
278881.5 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23625982 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5366166 ns |
6153209 ns |
0.87 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
411123 ns |
418644 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
19687.5 ns |
10416.5 ns |
1.89 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
18500 ns |
9500 ns |
1.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
23937 ns |
11000 ns |
2.18 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
21083.5 ns |
8958 ns |
2.35 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
134758.5 ns |
137213 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3415558 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
867333 ns |
886500 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
67810 ns |
74561 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
13333 ns |
7458 ns |
1.79 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
12937.5 ns |
7750 ns |
1.67 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
14500 ns |
8208 ns |
1.77 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
13459 ns |
7541 ns |
1.78 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
538019 ns |
553485 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18934516.5 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
3936125 ns |
4191417 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
335303 ns |
340023 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1666.5 ns |
1791.5 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1500 ns |
1625 ns |
0.92 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
2083 ns |
0.88 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1416.5 ns |
1583 ns |
0.89 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20929 ns |
21340 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1201516 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
293750 ns |
310625 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
192642 ns |
192401.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3395.5 ns |
3292 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3375 ns |
3375 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3500 ns |
3583 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3500 ns |
3375 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
235765.5 ns |
244685 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11074080 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1582208 ns |
1830688 ns |
0.86 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
596540 ns |
598576 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148250 ns |
148667 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
129854 ns |
128833 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
128895.5 ns |
129604 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225979.5 ns |
225042 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24092.5 ns |
24647 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1235643 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
292042 ns |
278416 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36870 ns |
37400 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
158833 ns |
143709 ns |
1.11 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
124333 ns |
124625 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
124000 ns |
110395.5 ns |
1.12 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
250771 ns |
287812.5 ns |
0.87 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
236446 ns |
242298 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
11052857 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2005854 ns |
2059479 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
225482 ns |
238587 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
429084 ns |
7125 ns |
60.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
237167 ns |
6000 ns |
39.53 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
498750 ns |
6000 ns |
83.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
488542 ns |
10062.5 ns |
48.55 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32663 ns |
33200 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1216110 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
339854 ns |
358750 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50830 ns |
52880 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
700417 ns |
220291 ns |
3.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
436375 ns |
231500 ns |
1.88 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
888520.5 ns |
229125 ns |
3.88 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
954833 ns |
245229.5 ns |
3.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
267203 ns |
272719 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29188160 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8201709 ns |
8345291.5 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
526085 ns |
536095 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15396 ns |
15417 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15125 ns |
15167 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16000 ns |
17042 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15625 ns |
15375 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
154544.5 ns |
158597.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5785019 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
755084 ns |
852042 ns |
0.89 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
239652 ns |
242502 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23708 ns |
23937.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23833 ns |
24666 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24334 ns |
23874.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
22750 ns |
23291.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
912012 ns |
931616 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
40200102.5 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5378958 ns |
5615896 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
693006 ns |
698756 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
45792 ns |
9958 ns |
4.60 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
42792 ns |
10166 ns |
4.21 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
50250 ns |
12333 ns |
4.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
43833.5 ns |
9083 ns |
4.83 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
137439.5 ns |
141537 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3438224.5 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
725875 ns |
805292 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
70921 ns |
77251 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
57979.5 ns |
14083 ns |
4.12 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
54833 ns |
14250 ns |
3.85 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
58625 ns |
14208.5 ns |
4.13 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
55750 ns |
13375 ns |
4.17 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
746727 ns |
768706 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
20973525 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
4583646 ns |
5278042 ns |
0.87 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
367203 ns |
378343 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
46479 ns |
10104.5 ns |
4.60 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
44250 ns |
10000 ns |
4.42 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
50770.5 ns |
11354 ns |
4.47 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
44583 ns |
9791.5 ns |
4.55 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
136494.5 ns |
139922.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3475417.5 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
874292 ns |
897959 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72190 ns |
77161 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
52500 ns |
12333 ns |
4.26 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
49958 ns |
12709 ns |
3.93 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
53854.5 ns |
13000 ns |
4.14 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
50520.5 ns |
12833.5 ns |
3.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
610073.5 ns |
626138 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19162614 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4104208 ns |
4505687.5 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
343583 ns |
350573 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
31333 ns |
27729 ns |
1.13 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
33271 ns |
35375 ns |
0.94 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
31167 ns |
32291 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1979.5 ns |
2041 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16542 ns |
16815 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
73361 ns |
83101 ns |
0.88 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5291 ns |
5291.5 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5125 ns |
5146 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5417 ns |
5209 ns |
1.04 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6541.5 ns |
6229.5 ns |
1.05 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
147388 ns |
151130 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
371153 ns |
372413 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4292 ns |
250 ns |
17.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4084 ns |
292 ns |
13.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4875 ns |
375 ns |
13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4167 ns |
250 ns |
16.67 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25733 ns |
26290 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1193675 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
340687 ns |
357500 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48841 ns |
48805.5 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
13625 ns |
7354 ns |
1.85 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13833 ns |
7250 ns |
1.91 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
15708 ns |
8041 ns |
1.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
14208.5 ns |
6979 ns |
2.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
194695.5 ns |
200306 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24326336 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5508000 ns |
6097521 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
392403 ns |
397569 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
31709 ns |
1958 ns |
16.19 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
31625 ns |
2041 ns |
15.49 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
32167 ns |
2084 ns |
15.44 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
31167 ns |
2000 ns |
15.58 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
26408 ns |
27273 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1195704 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
309499.5 ns |
493416.5 ns |
0.63 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
209172 ns |
209702 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
54000 ns |
17541 ns |
3.08 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
51250 ns |
18166.5 ns |
2.82 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
54084 ns |
17667 ns |
3.06 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
52208 ns |
17666.5 ns |
2.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
279056 ns |
285604 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25130314 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5457854.5 ns |
6161167 ns |
0.89 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
716901 ns |
724677 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148062.5 ns |
174417 ns |
0.85 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
152625 ns |
167583.5 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
156854 ns |
151417 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
148833 ns |
145583 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
220541 ns |
225867 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7910330 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1433416.5 ns |
1429395.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
176132 ns |
227572 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1331604 ns |
1321729 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1325375 ns |
1323417 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1321771 ns |
1328313 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1315145.5 ns |
1325750 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
983917 ns |
1001329 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47509928.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6671667 ns |
6753917 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1112170 ns |
1011639.5 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25750 ns |
24896.5 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25333 ns |
25250 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28395.5 ns |
28000 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25729 ns |
25542 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
265795.5 ns |
271026.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7976824.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
967791.5 ns |
986750 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
117731 ns |
119521 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
172583 ns |
117833 ns |
1.46 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
118708 ns |
120083 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
170583.5 ns |
118375 ns |
1.44 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
117292 ns |
176875 ns |
0.66 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1197612.5 ns |
1213900 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45677207 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6174459 ns |
6376312.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
600295 ns |
614965 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3458 ns |
291 ns |
11.88 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3334 ns |
375 ns |
8.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
375 ns |
18.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
2875 ns |
292 ns |
9.85 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
22675 ns |
23468 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1239187 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
290750 ns |
446666 ns |
0.65 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
49411 ns |
49170 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
13917 ns |
7562.5 ns |
1.84 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13791.5 ns |
7584 ns |
1.82 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
16000 ns |
8000 ns |
2 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
14250 ns |
6875 ns |
2.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
200825 ns |
206004 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24296939 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5238458 ns |
5961166 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
396233 ns |
407824 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6750 ns |
5812 ns |
1.16 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5937.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6792 ns |
7333 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6500 ns |
6854.5 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
165626 ns |
167575 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5771286 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
502625 ns |
672646 ns |
0.75 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
238537 ns |
240143 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9959 ns |
9875 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9791.5 ns |
9834 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10125 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9542 ns |
9854 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
959124 ns |
978859.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41918838 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5914646 ns |
5692125.5 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
679656 ns |
683826 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
666 ns |
708 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
666 ns |
666 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23007 ns |
23025 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2142862 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
222000 ns |
214354 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
215712 ns |
216952 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4583 ns |
4667 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4709 ns |
4667 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4917 ns |
4958 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4625 ns |
4542 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
229000 ns |
242032 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9639894 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1615667 ns |
1648667 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
600836 ns |
606251 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
17333 ns |
8750 ns |
1.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
17792 ns |
8500 ns |
2.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
22917 ns |
9917 ns |
2.31 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
19458 ns |
8375 ns |
2.32 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
134999.5 ns |
139395 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3466882.5 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
756875 ns |
800687.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69155.5 ns |
77821 ns |
0.89 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
14583 ns |
8625 ns |
1.69 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
14542 ns |
8625 ns |
1.69 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
16375 ns |
8979.5 ns |
1.82 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
14896 ns |
8479 ns |
1.76 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
657528.5 ns |
674531.5 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
20992340 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4676208 ns |
4665667 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
351823 ns |
358963 ns |
0.98 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
127208 ns |
126000 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129354 ns |
130375 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130583.5 ns |
129416 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183354 ns |
183687.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46414 ns |
46315 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
96011 ns |
97061 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
303500 ns |
332208 ns |
0.91 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
345209 ns |
323917 ns |
1.07 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
327375 ns |
315709 ns |
1.04 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
615750 ns |
569000 ns |
1.08 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
205613.5 ns |
209770 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
489039 ns |
517105 ns |
0.95 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396875 ns |
397958 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287958 ns |
288166 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287917 ns |
288250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
754667 ns |
756041.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43605 ns |
44247 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1378944 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
421709 ns |
421167 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83791 ns |
84151 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1454625 ns |
1380646 ns |
1.05 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1138125 ns |
1132937.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1134958 ns |
1131583.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2484042 ns |
2441875 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
260159.5 ns |
276054.5 ns |
0.94 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
12659523.5 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1794291 ns |
1744958 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
354928 ns |
354794 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
666271 ns |
655000 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
676375.5 ns |
645458 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
653250 ns |
606125 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
626667 ns |
651333 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
219293.5 ns |
211637.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8648715 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1345459 ns |
1332417 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
244872 ns |
234477 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2450104.5 ns |
2442417 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2446958 ns |
2443729 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2449083.5 ns |
2460479.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2434250 ns |
2466125 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1069321.5 ns |
1084419 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
53497546 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7292437.5 ns |
9616354 ns |
0.76 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1446123 ns |
1491474 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33792 ns |
32917 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35375 ns |
35833 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34583 ns |
35333 ns |
0.98 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
895.5 ns |
958 ns |
0.93 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16123 ns |
16181 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
72861 ns |
81781 ns |
0.89 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3000 ns |
3083 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3333 ns |
3166 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3542 ns |
3417 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3000 ns |
3042 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
146404.5 ns |
149907.5 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
348953 ns |
345503 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3811625 ns |
406833.5 ns |
9.37 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2040334 ns |
408833 ns |
4.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1910292 ns |
409208.5 ns |
4.67 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
6752062.5 ns |
420333 ns |
16.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
42704 ns |
44137 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1384773 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1162708 ns |
1179333.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
243042 ns |
242582 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
7177541 ns |
3874541 ns |
1.85 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4995166 ns |
3981625 ns |
1.25 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5603937.5 ns |
3995271 ns |
1.40 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
10187834 ns |
3778020.5 ns |
2.70 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
247421 ns |
254416 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37946938 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11824125 ns |
12000083 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1245751 ns |
1240627 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
4000 ns |
3917 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
4166 ns |
3958 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33191 ns |
35129 ns |
0.94 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1226175 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
178041 ns |
181625 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40780 ns |
42720 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
16209 ns |
15500 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16000 ns |
15708 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
16375 ns |
16084 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15708 ns |
15834 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
267073.5 ns |
276415 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
9646077.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
855042 ns |
889271 ns |
0.96 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
164241.5 ns |
176511 ns |
0.93 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404791 ns |
404209 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295667 ns |
295395.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
294792 ns |
295625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760625 ns |
760584 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113148 ns |
113822.5 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1051645.5 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
413438 ns |
409229 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89061 ns |
92275.5 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1489875 ns |
1418500 ns |
1.05 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1163354 ns |
1143416 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1163000 ns |
1157042 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2466958 ns |
2464062 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
262128 ns |
252054 ns |
1.04 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10492873 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1875708 ns |
1932667 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
355908 ns |
360264 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4042 ns |
458 ns |
8.83 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3875 ns |
542 ns |
7.15 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4584 ns |
583 ns |
7.86 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3959 ns |
500 ns |
7.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26052 ns |
26614 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1182454 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
293750 ns |
362145.5 ns |
0.81 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
209722 ns |
209492 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
15187.5 ns |
8375 ns |
1.81 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
15208 ns |
8542 ns |
1.78 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
17125 ns |
9125 ns |
1.88 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
15229 ns |
8208 ns |
1.86 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
212892 ns |
219325 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25606131.5 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5648583 ns |
6248208.5 ns |
0.90 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
694801.5 ns |
707647 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
835562.5 ns |
835021 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
620125 ns |
618583 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
620687.5 ns |
620791 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1544666.5 ns |
1547209 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
132526.5 ns |
131693 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
168801 ns |
167721 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2686104.5 ns |
2699249.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
2002979 ns |
2010542 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2010625 ns |
2008750 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4942145.5 ns |
4923458 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
254138 ns |
254591.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
866423 ns |
880209 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3334 ns |
333 ns |
10.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3375 ns |
375 ns |
9 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6812.5 ns |
375 ns |
18.17 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
2709 ns |
291 ns |
9.31 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32032 ns |
32661 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1203251 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
272916 ns |
283208 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49530 ns |
49561 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
13291.5 ns |
7417 ns |
1.79 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
13000 ns |
7417 ns |
1.75 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
14708 ns |
7958 ns |
1.85 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
13083 ns |
7083 ns |
1.85 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
224632 ns |
230552 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22001406 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4935958 ns |
5450104 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
370213 ns |
374993 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2421500 ns |
2388875 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2393375 ns |
2390042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2375042 ns |
2387625 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2390396 ns |
2385041 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
219055 ns |
222782 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8092339.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1468042 ns |
1608854 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
359023 ns |
336514 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4647833 ns |
4653250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4651167 ns |
4641333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4677124.5 ns |
4667333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4648750 ns |
4656333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
974219 ns |
986732 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
46807279 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6424625 ns |
6571104 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1416527.5 ns |
1423514 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
22500 ns |
6875 ns |
3.27 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7083 ns |
7396 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7209 ns |
7583 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6792 ns |
6958.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23346 ns |
24376 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1158059.5 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
260417 ns |
275584 ns |
0.94 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
38300 ns |
34810 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
49750 ns |
33521 ns |
1.48 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
49520.5 ns |
33500 ns |
1.48 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33000 ns |
33583 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
46542 ns |
32667 ns |
1.42 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
235206 ns |
243530.5 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10567375 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2058708 ns |
2038145.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
238552 ns |
242918 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
23375 ns |
21625 ns |
1.08 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
25562.5 ns |
26250 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
23604 ns |
25209 ns |
0.94 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5042 ns |
5167 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
17704 ns |
18282 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
85431 ns |
86261 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12292 ns |
11875 ns |
1.04 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10542 ns |
10417 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10708 ns |
10833 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18000 ns |
17792 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
246035 ns |
249417 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
391284 ns |
378534 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406625 ns |
406250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297042 ns |
297375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296584 ns |
296750 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762208 ns |
762958 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46469 ns |
47260 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1389885 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
432958 ns |
509104 ns |
0.85 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90831 ns |
89561 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1487458 ns |
1445500 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1169791 ns |
1166562.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1167417 ns |
1168167 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2472979.5 ns |
2472542 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
303943 ns |
314496 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11208548 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2046584 ns |
2114437.5 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
378743 ns |
384754 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
4011667 ns |
434833.5 ns |
9.23 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2111583 ns |
436583 ns |
4.84 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2016833.5 ns |
436750 ns |
4.62 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
7086979.5 ns |
447625 ns |
15.83 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54688 ns |
55692 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1025990 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1128125 ns |
1118708.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
237392 ns |
238023 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
7118333 ns |
3881625 ns |
1.83 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4967562.5 ns |
4013979 ns |
1.24 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5565146.5 ns |
4029083 ns |
1.38 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
9774125 ns |
3805271 ns |
2.57 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
267660 ns |
274092 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30789701 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10301187.5 ns |
10308938 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1242376 ns |
1240392 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8792 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7833 ns |
7666 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7834 ns |
7709 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12416 ns |
12417 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24327 ns |
24383 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2162560 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
218521 ns |
228000 ns |
0.96 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
217622 ns |
220382 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46041 ns |
44708 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45542 ns |
45250 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45667 ns |
45208 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45042 ns |
45209 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
357694 ns |
367981 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
12955258 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1668416.5 ns |
1846667 ns |
0.90 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
673576 ns |
666456 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3128875 ns |
83167 ns |
37.62 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1714291 ns |
83416 ns |
20.55 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2883292 ns |
83917 ns |
34.36 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
4743542 ns |
94583 ns |
50.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190024.5 ns |
190250 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6261245 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2084333.5 ns |
2072000 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204737 ns |
172412 ns |
1.19 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5745041.5 ns |
1982729 ns |
2.90 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3425542 ns |
2023063 ns |
1.69 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3393458.5 ns |
2022000 ns |
1.68 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
9080250 ns |
2016958 ns |
4.50 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
572657 ns |
583620.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27146313 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9775416 ns |
9865250 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1095715 ns |
1098935.5 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/ka_cpu
branch
4 times, most recently
from
August 22, 2024 16:58
ca5d8df
to
f524b7e
Compare
[skip tests]
[skip tests]
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, this is a performance disaster. Locally, I see like slowdowns of atleast 5-10x. Let's see the numbers on the dedicated benchmarks.
The main pro of this approach is that the maintenance burden significantly goes down. Now how can we solve this? (Probably this is better off as a KA Issue)
Finer control of
CPU
backend from KA:@simd
and@simd ivdep
loop info. Either by default or by supplying to the backend object -- Make CPU loops simd & ivdep JuliaGPU/KernelAbstractions.jl#436